Version: 1.3.0

SSE Streaming

DVARA supports Server-Sent Events (SSE) streaming for all providers. Streaming delivers tokens incrementally as they are generated, reducing perceived latency.

Enabling Streaming

Set "stream": true in your chat completion request:

curl -s -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "user", "content": "Count to five slowly."}
    ],
    "stream": true
  }'

SSE Protocol

The response uses Content-Type: text/event-stream. Each event is a JSON chunk prefixed with data: :

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":"One"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":", two"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Format

Each chunk contains:

id — Stable completion ID across all chunks
object — Always "chat.completion.chunk"
model — The model used
choices[].delta — Incremental content (content for text, role for the first chunk)
choices[].finish_reason — null during streaming, "stop" on the final content chunk

The stream terminates with data: [DONE].

Streaming with Structured Outputs

Streaming works with response_format. From the client's point of view the experience is identical no matter which provider handles the call — JSON bytes arrive progressively as content deltas. The gateway absorbs the per-provider streaming-protocol differences (see Structured Outputs for the rewrite mechanics).

Provider group	Streaming behaviour with `response_format`
OpenAI, Azure OpenAI, Mistral, Gemini, Grok	Native. Standard OpenAI-format `content` deltas.
Anthropic, Bedrock	Tool-use rewrite for `json_schema`. The gateway flattens the upstream tool-call stream into normal `content` deltas — JSON bytes arrive progressively, no client-side tool-call handling required.
DeepSeek, Moonshot, ChatGLM, Groq	Native `json_object`. `json_schema` routed away by capability filter.
Qwen, Cohere, Ollama	Filtered out for any `response_format`. A request on a route with only these providers fails fast with `no_capable_provider` (HTTP 400) before any streaming starts.
Mock	Streams the wrapped `{"result": "..."}` text one token at a time.

Timeout Configuration

Streaming requests have a separate, longer timeout than non-streaming requests:

dvara:
  llm-gateway:
    resilience:
      timeout:
        chat-timeout-ms: 30000        # non-streaming: 30 seconds
        streaming-timeout-ms: 120000  # streaming: 120 seconds

Caching Behavior

Streaming requests bypass the response cache entirely — no cache lookup, no cache storage, no X-Cache header. This is by design, as streaming responses are typically unique and the cache cannot store partial streams.

Limitations

A few cases where the streaming path differs from non-streaming. None are blockers, but they're worth knowing before you wire up a client.

No usage chunks. Streaming responses don't emit a final chunk carrying token counts; OpenAI's stream_options: {include_usage: true} is not honored. The gateway still records token usage server-side — query it via GET /v1/admin/token-usage after the call, or use a non-streaming request when per-call token counts are needed in the response itself.
Native tool calls aren't surfaced as tool deltas. The gateway normalizes every chunk to a content delta on the wire. When the upstream emits native tool-call deltas (OpenAI/Anthropic/Bedrock without a json_schema rewrite), those deltas don't carry through — the client sees empty content deltas and a finish_reason: tool_calls. Use non-streaming when your client needs to consume tool calls. The json_schema rewrite path on Anthropic/Bedrock does flatten its tool-use into content text deltas, so structured-output streaming via response_format works fine.
No X-Gateway-Strict-Downgraded header on streams. That header is set on a non-streaming response; streamed responses don't carry it. If you need to detect strict downgrade for a streamed call, do the same call non-streaming first to read the header, then stream subsequent calls knowing the route's downgrade behavior.
Client disconnect. When the SSE consumer closes the connection, the gateway flags the emitter as completed and stops sending. The upstream provider call may continue briefly until the underlying HTTP client times out — token usage for the partial response is still recorded.

Governance on streaming responses

PII, guardrail, and grounding checks run on streamed responses just like non-streaming. When enforcement triggers, the stream terminates with finish_reason: content_filter instead of the normal stop. Clients distinguish "the model finished" (stop / length) from "the gateway pulled the plug" (content_filter) by inspecting this field — that's the only signal callers need to react to.

Because decisions have to be made on partial text, the gateway buffers each text delta into a scan window and runs the detectors at every window boundary. Detections fire in-flight; grounding checks the full response once at the end of the stream.

Per-action behaviour

Subsystem	Action	What the client sees	Audit event
PII	`BLOCK`	Final chunk with `finish_reason: content_filter`, stream stops.	`PII_BLOCKED_STREAMING` (per detection)
PII	`REDACT`	Detected spans rewritten in-flight to tokenized placeholders of the form `{{PII_<TYPE>_<hex>}}` (e.g. `{{PII_EMAIL_a1b2c3}}`) before the delta leaves the gateway. The same token format drives the detokenize round-trip.	summary at stream end (see below)
PII	`LOG`	Stream passes through unchanged.	summary at stream end
Guardrail	`BLOCK`	Final chunk with `finish_reason: content_filter`, stream stops.	`GUARDRAIL_BLOCKED_STREAMING` (per detection)
Guardrail	`FLAG` / `LOG`	Stream passes through unchanged.	summary at stream end
Grounding	`BLOCK`	Final delta is suppressed; stream terminates with `finish_reason: content_filter`.	`HALLUCINATION_DETECTED_STREAMING`
Grounding	`FLAG` / `LOG`	Stream completes normally.	`HALLUCINATION_DETECTED_STREAMING`

When at least one detection fired but the stream completed (no BLOCK), the gateway writes a single STREAMING_ENFORCEMENT_SUMMARY event at stream end with counts: pii_entity_count, guardrail_detection_count. Per-detection events fire only for the BLOCK actions — LOG / FLAG / REDACT modes don't emit per-detection rows during the stream.

This is a deliberate trade-off: a long stream that triggers PII LOG 500 times shouldn't produce 500 audit rows. If you need per-detection visibility on a non-block path, use a non-streaming call or graduate the action to BLOCK for the categories that warrant a row each.

Because grounding runs once at stream end against the full accumulated response, it adds no mid-stream latency — full streaming throughput holds right up to the last chunk.

Scan window mechanics

Scanning every character as it arrives would be expensive, so the gateway buffers text into a scan window of size scanWindowSize (characters). When the buffer fills, the gateway scans the "safe region" — everything except the last overlapMargin characters — and retains the overlap for the next scan. The overlap exists to catch patterns that would otherwise be missed because they straddle two scan windows; without it, an email address split across a 255-char boundary and a 256-char boundary would slip through silently.

Default scan window: 256 characters
Default overlap margin: 64 characters
Per-request floor: scanWindowSize is clamped up to 32, overlapMargin is clamped up to 16, and overlapMargin is clamped to scanWindowSize / 2 if it would otherwise equal or exceed scanWindowSize. Out-of-range YAML values are silently clamped at request time, not rejected at boot — setting streaming-overlap-margin: 0 looks accepted in the gateway log on startup but runs as 16 in production. Pick the value you actually want.

Tune the window if you need earlier-firing detection (smaller window = more scans, more CPU, earlier catches) or cheaper streams (larger window = fewer scans, later catches). The defaults are calibrated for typical chat completions; most deployments never touch them.

Configuration

Streaming governance inherits the PII and guardrail settings from the non-streaming path, plus a pair of scan-streaming-responses toggles that can disable streaming enforcement specifically:

dvara:
  llm-gateway:
    pii:
      enabled: true
      default-action: LOG                   # BLOCK / REDACT / LOG
      scan-streaming-responses: true        # turn off to skip streaming PII scans only
      streaming-scan-window-size: 256
      streaming-overlap-margin: 64
    guardrail:
      enabled: true
      default-action: LOG                   # BLOCK / FLAG / LOG
      scan-streaming-responses: true
      streaming-scan-window-size: 256
      streaming-overlap-margin: 64

The gateway uses the smaller of the two streaming-scan-window-size values and the larger of the two streaming-overlap-margin values when both PII and guardrail scanning are active, so a single scan boundary covers both subsystems. You can't pick one set of window mechanics for PII and a different set for guardrail — the scanner is shared.

Per-tenant overrides

Any tenant can turn streaming scanning on or off independently of the global default, and can pick a different action than the global default, via Tenant.metadata:

Metadata key	Type	Effect
`pii.scan-streaming-responses`	boolean	Enable or disable streaming PII scanning for this tenant.
`guardrail.scan-streaming-responses`	boolean	Enable or disable streaming guardrail scanning for this tenant.
`pii.action`	`BLOCK` / `REDACT` / `LOG`	Per-tenant PII action applied to both streaming and non-streaming paths.
`guardrail.action`	`BLOCK` / `FLAG` / `LOG`	Per-tenant guardrail action applied to both paths.
`grounding.enabled`	boolean	Enable or disable grounding at stream end for this tenant.
`grounding.action`	`BLOCK` / `FLAG` / `LOG`	Action when grounding detects ungrounded claims.

Scan window size and overlap margin are not per-tenant — they're shared across the whole gateway and set at startup. If different tenants need different strictness, adjust action per tenant rather than the window mechanics.

Try it

The minimal curl invocation:

curl -s -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "messages": [{"role": "user", "content": "Write a haiku about programming."}],
    "stream": true
  }'

The same call from Python with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(api_key="<your-dvara-api-key>", base_url="http://localhost:8080/v1")

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about programming."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

For end-to-end Node.js and Java consumers — including line-iterator parsing, partial-JSON accumulation for json_schema streams, and a reconnect strategy — see the Streaming end-to-end cookbook recipe.

Provider-Specific Notes

Every built-in provider supports streaming. The gateway normalizes the upstream chunk format — every provider, regardless of its native streaming protocol — into the same OpenAI-format chunks the client receives. Governance and audit always see the same shape, so no client-side branching by provider is needed.

Provider	Upstream streaming protocol	Notes
OpenAI	`data: {"choices":[{"delta":…}]}` lines	Standard OpenAI-format deltas.
Azure OpenAI	Same shape as OpenAI	Same.
Mistral	OpenAI-compatible chunks	Same.
Grok	OpenAI-compatible chunks	Same. Supports `json_schema` natively.
Gemini	`data: {"candidates":[{"content":…}]}` lines	The gateway walks the `candidates[].content.parts[]` tree and concatenates text parts per chunk.
Anthropic	Event-based: `message_start`, `content_block_delta`, `message_delta`, `message_stop`	Each event becomes one `content` delta on the wire. With `json_schema`, the upstream's tool-use deltas are flattened into `content` text so the JSON arrives progressively.
Bedrock	Binary `bedrock-runtime` event stream	Decoded server-side; both normal text and the `json_schema` tool-use path produce progressive `content` deltas. The upstream `tool_use` stop reason is mapped to `stop`.
DeepSeek, Moonshot, ChatGLM	OpenAI-compatible chunks	Same as OpenAI for `json_object`; `json_schema` filtered out by capability routing.
Cohere	OpenAI-style chunks from Cohere's v2 chat API	No `response_format` support — filtered out when one is requested.
Groq	OpenAI-compatible chunks	Supports `json_object` streaming but not `json_schema`.
Qwen	OpenAI-compatible chunks	No `response_format` support — filtered out when one is requested.
Ollama	OpenAI-compatible chunks	No `response_format` support — filtered out when one is requested.
Mock	Simulated token-by-token emission with configurable delay	Splits the configured response text into tokens and emits one chunk per token, separated by `dvara.llm-gateway.providers.mock.stream-token-delay-ms` (default 20 ms). Useful for integration tests and for exercising the streaming governance path with deterministic content.

Enabling Streaming​

SSE Protocol​

Chunk Format​

Streaming with Structured Outputs​

Timeout Configuration​

Caching Behavior​

Limitations​

Governance on streaming responses​

Per-action behaviour​

Scan window mechanics​

Configuration​

Per-tenant overrides​

Try it​

Provider-Specific Notes​