SSE Streaming
DVARA supports Server-Sent Events (SSE) streaming for all providers. Streaming delivers tokens incrementally as they are generated, reducing perceived latency.
Enabling Streaming
Set "stream": true in your chat completion request:
curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Count to five slowly."}
],
"stream": true
}'
SSE Protocol
The response uses Content-Type: text/event-stream. Each event is a JSON chunk prefixed with data: :
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":"One"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":", two"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Chunk Format
Each chunk contains:
id— Stable completion ID across all chunksobject— Always"chat.completion.chunk"model— The model usedchoices[].delta— Incremental content (contentfor text,rolefor the first chunk)choices[].finish_reason—nullduring streaming,"stop"on the final content chunk
The stream terminates with data: [DONE].
Streaming with Structured Outputs
Streaming works with response_format. From the client's point of view the experience is identical no matter which provider handles the call — JSON bytes arrive progressively as content deltas. The gateway absorbs the per-provider streaming-protocol differences (see Structured Outputs for the rewrite mechanics).
| Provider group | Streaming behaviour with response_format |
|---|---|
| OpenAI, Azure OpenAI, Mistral, Gemini, Grok | Native. Standard OpenAI-format content deltas. |
| Anthropic, Bedrock | Tool-use rewrite for json_schema. The gateway flattens the upstream tool-call stream into normal content deltas — JSON bytes arrive progressively, no client-side tool-call handling required. |
| DeepSeek, Moonshot, ChatGLM, Groq | Native json_object. json_schema routed away by capability filter. |
| Qwen, Cohere, Ollama | Filtered out for any response_format. A request on a route with only these providers fails fast with no_capable_provider (HTTP 400) before any streaming starts. |
| Mock | Streams the wrapped {"result": "..."} text one token at a time. |
Timeout Configuration
Streaming requests have a separate, longer timeout than non-streaming requests:
dvara:
llm-gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # non-streaming: 30 seconds
streaming-timeout-ms: 120000 # streaming: 120 seconds
Caching Behavior
Streaming requests bypass the response cache entirely — no cache lookup, no cache storage, no X-Cache header. This is by design, as streaming responses are typically unique and the cache cannot store partial streams.
Limitations
A few cases where the streaming path differs from non-streaming. None are blockers, but they're worth knowing before you wire up a client.
- No usage chunks. Streaming responses don't emit a final chunk carrying token counts; OpenAI's
stream_options: {include_usage: true}is not honored. The gateway still records token usage server-side — query it viaGET /v1/admin/token-usageafter the call, or use a non-streaming request when per-call token counts are needed in the response itself. - Native tool calls aren't surfaced as tool deltas. The gateway normalizes every chunk to a
contentdelta on the wire. When the upstream emits native tool-call deltas (OpenAI/Anthropic/Bedrock without ajson_schemarewrite), those deltas don't carry through — the client sees empty content deltas and afinish_reason: tool_calls. Use non-streaming when your client needs to consume tool calls. Thejson_schemarewrite path on Anthropic/Bedrock does flatten its tool-use intocontenttext deltas, so structured-output streaming viaresponse_formatworks fine. - No
X-Gateway-Strict-Downgradedheader on streams. That header is set on a non-streaming response; streamed responses don't carry it. If you need to detect strict downgrade for a streamed call, do the same call non-streaming first to read the header, then stream subsequent calls knowing the route's downgrade behavior. - Client disconnect. When the SSE consumer closes the connection, the gateway flags the emitter as completed and stops sending. The upstream provider call may continue briefly until the underlying HTTP client times out — token usage for the partial response is still recorded.
Governance on streaming responses
PII, guardrail, and grounding checks run on streamed responses just like non-streaming. When enforcement triggers, the stream terminates with finish_reason: content_filter instead of the normal stop. Clients distinguish "the model finished" (stop / length) from "the gateway pulled the plug" (content_filter) by inspecting this field — that's the only signal callers need to react to.
Because decisions have to be made on partial text, the gateway buffers each text delta into a scan window and runs the detectors at every window boundary. Detections fire in-flight; grounding checks the full response once at the end of the stream.
Per-action behaviour
| Subsystem | Action | What the client sees | Audit event |
|---|---|---|---|
| PII | BLOCK | Final chunk with finish_reason: content_filter, stream stops. | PII_BLOCKED_STREAMING (per detection) |
| PII | REDACT | Detected spans rewritten in-flight to tokenized placeholders ([EMAIL], [PHONE], [SSN], …) before the delta leaves the gateway. | summary at stream end (see below) |
| PII | LOG | Stream passes through unchanged. | summary at stream end |
| Guardrail | BLOCK | Final chunk with finish_reason: content_filter, stream stops. | GUARDRAIL_BLOCKED_STREAMING (per detection) |
| Guardrail | FLAG / LOG | Stream passes through unchanged. | summary at stream end |
| Grounding | BLOCK | Final delta is suppressed; stream terminates with finish_reason: content_filter. | HALLUCINATION_DETECTED_STREAMING |
| Grounding | FLAG / LOG | Stream completes normally. | HALLUCINATION_DETECTED_STREAMING |
When at least one detection fired but the stream completed (no BLOCK), the gateway writes a single STREAMING_ENFORCEMENT_SUMMARY event at stream end with counts: pii_entity_count, guardrail_detection_count. Per-detection events fire only for the BLOCK actions — LOG / FLAG / REDACT modes don't emit per-detection rows during the stream.
This is a deliberate trade-off: a long stream that triggers PII LOG 500 times shouldn't produce 500 audit rows. If you need per-detection visibility on a non-block path, use a non-streaming call or graduate the action to BLOCK for the categories that warrant a row each.
Because grounding runs once at stream end against the full accumulated response, it adds no mid-stream latency — full streaming throughput holds right up to the last chunk.
Scan window mechanics
Scanning every character as it arrives would be expensive, so the gateway buffers text into a scan window of size scanWindowSize (characters). When the buffer fills, the gateway scans the "safe region" — everything except the last overlapMargin characters — and retains the overlap for the next scan. The overlap exists to catch patterns that would otherwise be missed because they straddle two scan windows; without it, an email address split across a 255-char boundary and a 256-char boundary would slip through silently.
- Default scan window: 256 characters
- Default overlap margin: 64 characters
- Per-request floor:
scanWindowSizeis clamped up to32,overlapMarginis clamped up to16, andoverlapMarginis clamped toscanWindowSize / 2if it would otherwise equal or exceedscanWindowSize. Out-of-range YAML values are silently clamped at request time, not rejected at boot — settingstreaming-overlap-margin: 0looks accepted in the gateway log on startup but runs as16in production. Pick the value you actually want.
Tune the window if you need earlier-firing detection (smaller window = more scans, more CPU, earlier catches) or cheaper streams (larger window = fewer scans, later catches). The defaults are calibrated for typical chat completions; most deployments never touch them.
Configuration
Streaming governance inherits the PII and guardrail settings from the non-streaming path, plus a pair of scan-streaming-responses toggles that can disable streaming enforcement specifically:
dvara:
llm-gateway:
pii:
enabled: true
default-action: LOG # BLOCK / REDACT / LOG
scan-streaming-responses: true # turn off to skip streaming PII scans only
streaming-scan-window-size: 256
streaming-overlap-margin: 64
guardrail:
enabled: true
default-action: LOG # BLOCK / FLAG / LOG
scan-streaming-responses: true
streaming-scan-window-size: 256
streaming-overlap-margin: 64
The gateway uses the smaller of the two streaming-scan-window-size values and the larger of the two streaming-overlap-margin values when both PII and guardrail scanning are active, so a single scan boundary covers both subsystems. You can't pick one set of window mechanics for PII and a different set for guardrail — the scanner is shared.
Per-tenant overrides
Any tenant can turn streaming scanning on or off independently of the global default, and can pick a different action than the global default, via Tenant.metadata:
| Metadata key | Type | Effect |
|---|---|---|
pii.scan-streaming-responses | boolean | Enable or disable streaming PII scanning for this tenant. |
guardrail.scan-streaming-responses | boolean | Enable or disable streaming guardrail scanning for this tenant. |
pii.action | BLOCK / REDACT / LOG | Per-tenant PII action applied to both streaming and non-streaming paths. |
guardrail.action | BLOCK / FLAG / LOG | Per-tenant guardrail action applied to both paths. |
grounding.enabled | boolean | Enable or disable grounding at stream end for this tenant. |
grounding.action | BLOCK / FLAG / LOG | Action when grounding detects ungrounded claims. |
Scan window size and overlap margin are not per-tenant — they're shared across the whole gateway and set at startup. If different tenants need different strictness, adjust action per tenant rather than the window mechanics.
Try it
The minimal curl invocation:
curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Write a haiku about programming."}],
"stream": true
}'
The same call from Python with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(api_key="<your-dvara-api-key>", base_url="http://localhost:8080/v1")
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about programming."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
For end-to-end Node.js and Java consumers — including line-iterator parsing, partial-JSON accumulation for json_schema streams, and a reconnect strategy — see the Streaming end-to-end cookbook recipe.
Provider-Specific Notes
Every built-in provider supports streaming. The gateway normalizes the upstream chunk format — every provider, regardless of its native streaming protocol — into the same OpenAI-format chunks the client receives. Governance and audit always see the same shape, so no client-side branching by provider is needed.
| Provider | Upstream streaming protocol | Notes |
|---|---|---|
| OpenAI | data: {"choices":[{"delta":…}]} lines | Standard OpenAI-format deltas. |
| Azure OpenAI | Same shape as OpenAI | Same. |
| Mistral | OpenAI-compatible chunks | Same. |
| Grok | OpenAI-compatible chunks | Same. Supports json_schema natively. |
| Gemini | data: {"candidates":[{"content":…}]} lines | The gateway walks the candidates[].content.parts[] tree and concatenates text parts per chunk. |
| Anthropic | Event-based: message_start, content_block_delta, message_delta, message_stop | Each event becomes one content delta on the wire. With json_schema, the upstream's tool-use deltas are flattened into content text so the JSON arrives progressively. |
| Bedrock | Binary bedrock-runtime event stream | Decoded server-side; both normal text and the json_schema tool-use path produce progressive content deltas. The upstream tool_use stop reason is mapped to stop. |
| DeepSeek, Moonshot, ChatGLM | OpenAI-compatible chunks | Same as OpenAI for json_object; json_schema filtered out by capability routing. |
| Cohere | OpenAI-style chunks from Cohere's v2 chat API | No response_format support — filtered out when one is requested. |
| Groq | OpenAI-compatible chunks | Supports json_object streaming but not json_schema. |
| Qwen | OpenAI-compatible chunks | No response_format support — filtered out when one is requested. |
| Ollama | OpenAI-compatible chunks | No response_format support — filtered out when one is requested. |
| Mock | Simulated token-by-token emission with configurable delay | Splits the configured response text into tokens and emits one chunk per token, separated by dvara.llm-gateway.providers.mock.stream-token-delay-ms (default 20 ms). Useful for integration tests and for exercising the streaming governance path with deterministic content. |