Skip to main content

SSE Streaming

DVARA supports Server-Sent Events (SSE) streaming for all providers. Streaming delivers tokens incrementally as they are generated, reducing perceived latency.

Enabling Streaming

Set "stream": true in your chat completion request:

curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Count to five slowly."}
],
"stream": true
}'

SSE Protocol

The response uses Content-Type: text/event-stream. Each event is a JSON chunk prefixed with data: :

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":"One"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":", two"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Format

Each chunk contains:

  • id — Stable completion ID across all chunks
  • object — Always "chat.completion.chunk"
  • model — The model used
  • choices[].delta — Incremental content (content for text, role for the first chunk)
  • choices[].finish_reasonnull during streaming, "stop" on the final content chunk

The stream terminates with data: [DONE].

Streaming with Structured Outputs

Streaming works with response_format. From the client's point of view the experience is identical no matter which provider handles the call — JSON bytes arrive progressively as content deltas. The gateway absorbs the per-provider streaming-protocol differences (see Structured Outputs for the rewrite mechanics).

Provider groupStreaming behaviour with response_format
OpenAI, Azure OpenAI, Mistral, Gemini, GrokNative. Standard OpenAI-format content deltas.
Anthropic, BedrockTool-use rewrite for json_schema. The gateway flattens the upstream tool-call stream into normal content deltas — JSON bytes arrive progressively, no client-side tool-call handling required.
DeepSeek, Moonshot, ChatGLM, GroqNative json_object. json_schema routed away by capability filter.
Qwen, Cohere, OllamaFiltered out for any response_format. A request on a route with only these providers fails fast with no_capable_provider (HTTP 400) before any streaming starts.
MockStreams the wrapped {"result": "..."} text one token at a time.

Timeout Configuration

Streaming requests have a separate, longer timeout than non-streaming requests:

dvara:
llm-gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # non-streaming: 30 seconds
streaming-timeout-ms: 120000 # streaming: 120 seconds

Caching Behavior

Streaming requests bypass the response cache entirely — no cache lookup, no cache storage, no X-Cache header. This is by design, as streaming responses are typically unique and the cache cannot store partial streams.

Limitations

A few cases where the streaming path differs from non-streaming. None are blockers, but they're worth knowing before you wire up a client.

  • No usage chunks. Streaming responses don't emit a final chunk carrying token counts; OpenAI's stream_options: {include_usage: true} is not honored. The gateway still records token usage server-side — query it via GET /v1/admin/token-usage after the call, or use a non-streaming request when per-call token counts are needed in the response itself.
  • Native tool calls aren't surfaced as tool deltas. The gateway normalizes every chunk to a content delta on the wire. When the upstream emits native tool-call deltas (OpenAI/Anthropic/Bedrock without a json_schema rewrite), those deltas don't carry through — the client sees empty content deltas and a finish_reason: tool_calls. Use non-streaming when your client needs to consume tool calls. The json_schema rewrite path on Anthropic/Bedrock does flatten its tool-use into content text deltas, so structured-output streaming via response_format works fine.
  • No X-Gateway-Strict-Downgraded header on streams. That header is set on a non-streaming response; streamed responses don't carry it. If you need to detect strict downgrade for a streamed call, do the same call non-streaming first to read the header, then stream subsequent calls knowing the route's downgrade behavior.
  • Client disconnect. When the SSE consumer closes the connection, the gateway flags the emitter as completed and stops sending. The upstream provider call may continue briefly until the underlying HTTP client times out — token usage for the partial response is still recorded.

Governance on streaming responses

PII, guardrail, and grounding checks run on streamed responses just like non-streaming. When enforcement triggers, the stream terminates with finish_reason: content_filter instead of the normal stop. Clients distinguish "the model finished" (stop / length) from "the gateway pulled the plug" (content_filter) by inspecting this field — that's the only signal callers need to react to.

Because decisions have to be made on partial text, the gateway buffers each text delta into a scan window and runs the detectors at every window boundary. Detections fire in-flight; grounding checks the full response once at the end of the stream.

Per-action behaviour

SubsystemActionWhat the client seesAudit event
PIIBLOCKFinal chunk with finish_reason: content_filter, stream stops.PII_BLOCKED_STREAMING (per detection)
PIIREDACTDetected spans rewritten in-flight to tokenized placeholders ([EMAIL], [PHONE], [SSN], …) before the delta leaves the gateway.summary at stream end (see below)
PIILOGStream passes through unchanged.summary at stream end
GuardrailBLOCKFinal chunk with finish_reason: content_filter, stream stops.GUARDRAIL_BLOCKED_STREAMING (per detection)
GuardrailFLAG / LOGStream passes through unchanged.summary at stream end
GroundingBLOCKFinal delta is suppressed; stream terminates with finish_reason: content_filter.HALLUCINATION_DETECTED_STREAMING
GroundingFLAG / LOGStream completes normally.HALLUCINATION_DETECTED_STREAMING

When at least one detection fired but the stream completed (no BLOCK), the gateway writes a single STREAMING_ENFORCEMENT_SUMMARY event at stream end with counts: pii_entity_count, guardrail_detection_count. Per-detection events fire only for the BLOCK actions — LOG / FLAG / REDACT modes don't emit per-detection rows during the stream.

This is a deliberate trade-off: a long stream that triggers PII LOG 500 times shouldn't produce 500 audit rows. If you need per-detection visibility on a non-block path, use a non-streaming call or graduate the action to BLOCK for the categories that warrant a row each.

Because grounding runs once at stream end against the full accumulated response, it adds no mid-stream latency — full streaming throughput holds right up to the last chunk.

Scan window mechanics

Scanning every character as it arrives would be expensive, so the gateway buffers text into a scan window of size scanWindowSize (characters). When the buffer fills, the gateway scans the "safe region" — everything except the last overlapMargin characters — and retains the overlap for the next scan. The overlap exists to catch patterns that would otherwise be missed because they straddle two scan windows; without it, an email address split across a 255-char boundary and a 256-char boundary would slip through silently.

  • Default scan window: 256 characters
  • Default overlap margin: 64 characters
  • Per-request floor: scanWindowSize is clamped up to 32, overlapMargin is clamped up to 16, and overlapMargin is clamped to scanWindowSize / 2 if it would otherwise equal or exceed scanWindowSize. Out-of-range YAML values are silently clamped at request time, not rejected at boot — setting streaming-overlap-margin: 0 looks accepted in the gateway log on startup but runs as 16 in production. Pick the value you actually want.

Tune the window if you need earlier-firing detection (smaller window = more scans, more CPU, earlier catches) or cheaper streams (larger window = fewer scans, later catches). The defaults are calibrated for typical chat completions; most deployments never touch them.

Configuration

Streaming governance inherits the PII and guardrail settings from the non-streaming path, plus a pair of scan-streaming-responses toggles that can disable streaming enforcement specifically:

dvara:
llm-gateway:
pii:
enabled: true
default-action: LOG # BLOCK / REDACT / LOG
scan-streaming-responses: true # turn off to skip streaming PII scans only
streaming-scan-window-size: 256
streaming-overlap-margin: 64
guardrail:
enabled: true
default-action: LOG # BLOCK / FLAG / LOG
scan-streaming-responses: true
streaming-scan-window-size: 256
streaming-overlap-margin: 64

The gateway uses the smaller of the two streaming-scan-window-size values and the larger of the two streaming-overlap-margin values when both PII and guardrail scanning are active, so a single scan boundary covers both subsystems. You can't pick one set of window mechanics for PII and a different set for guardrail — the scanner is shared.

Per-tenant overrides

Any tenant can turn streaming scanning on or off independently of the global default, and can pick a different action than the global default, via Tenant.metadata:

Metadata keyTypeEffect
pii.scan-streaming-responsesbooleanEnable or disable streaming PII scanning for this tenant.
guardrail.scan-streaming-responsesbooleanEnable or disable streaming guardrail scanning for this tenant.
pii.actionBLOCK / REDACT / LOGPer-tenant PII action applied to both streaming and non-streaming paths.
guardrail.actionBLOCK / FLAG / LOGPer-tenant guardrail action applied to both paths.
grounding.enabledbooleanEnable or disable grounding at stream end for this tenant.
grounding.actionBLOCK / FLAG / LOGAction when grounding detects ungrounded claims.

Scan window size and overlap margin are not per-tenant — they're shared across the whole gateway and set at startup. If different tenants need different strictness, adjust action per tenant rather than the window mechanics.

Try it

The minimal curl invocation:

curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Write a haiku about programming."}],
"stream": true
}'

The same call from Python with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(api_key="<your-dvara-api-key>", base_url="http://localhost:8080/v1")

stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about programming."}],
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

For end-to-end Node.js and Java consumers — including line-iterator parsing, partial-JSON accumulation for json_schema streams, and a reconnect strategy — see the Streaming end-to-end cookbook recipe.

Provider-Specific Notes

Every built-in provider supports streaming. The gateway normalizes the upstream chunk format — every provider, regardless of its native streaming protocol — into the same OpenAI-format chunks the client receives. Governance and audit always see the same shape, so no client-side branching by provider is needed.

ProviderUpstream streaming protocolNotes
OpenAIdata: {"choices":[{"delta":…}]} linesStandard OpenAI-format deltas.
Azure OpenAISame shape as OpenAISame.
MistralOpenAI-compatible chunksSame.
GrokOpenAI-compatible chunksSame. Supports json_schema natively.
Geminidata: {"candidates":[{"content":…}]} linesThe gateway walks the candidates[].content.parts[] tree and concatenates text parts per chunk.
AnthropicEvent-based: message_start, content_block_delta, message_delta, message_stopEach event becomes one content delta on the wire. With json_schema, the upstream's tool-use deltas are flattened into content text so the JSON arrives progressively.
BedrockBinary bedrock-runtime event streamDecoded server-side; both normal text and the json_schema tool-use path produce progressive content deltas. The upstream tool_use stop reason is mapped to stop.
DeepSeek, Moonshot, ChatGLMOpenAI-compatible chunksSame as OpenAI for json_object; json_schema filtered out by capability routing.
CohereOpenAI-style chunks from Cohere's v2 chat APINo response_format support — filtered out when one is requested.
GroqOpenAI-compatible chunksSupports json_object streaming but not json_schema.
QwenOpenAI-compatible chunksNo response_format support — filtered out when one is requested.
OllamaOpenAI-compatible chunksNo response_format support — filtered out when one is requested.
MockSimulated token-by-token emission with configurable delaySplits the configured response text into tokens and emits one chunk per token, separated by dvara.llm-gateway.providers.mock.stream-token-delay-ms (default 20 ms). Useful for integration tests and for exercising the streaming governance path with deterministic content.