SSE Streaming
Dvara supports Server-Sent Events (SSE) streaming for all providers. Streaming delivers tokens incrementally as they are generated, reducing perceived latency.
Enabling Streaming
Set "stream": true in your chat completion request:
curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Count to five slowly."}
],
"stream": true
}'
SSE Protocol
The response uses Content-Type: text/event-stream. Each event is a JSON chunk prefixed with data: :
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":"One"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":", two"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Chunk Format
Each chunk contains:
id— Stable completion ID across all chunksobject— Always"chat.completion.chunk"model— The model usedchoices[].delta— Incremental content (contentfor text,rolefor the first chunk)choices[].finish_reason—nullduring streaming,"stop"on the final content chunk
The stream terminates with data: [DONE].
Streaming with Structured Outputs
Streaming works with response_format. For providers that use tool-use rewrite (Anthropic, Bedrock), the gateway translates input_json_delta chunks into content delta chunks — streaming works end-to-end and your application sees the same format regardless of provider.
Timeout Configuration
Streaming requests have a separate, longer timeout than non-streaming requests:
gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # non-streaming: 30 seconds
streaming-timeout-ms: 120000 # streaming: 120 seconds
Caching Behavior
Streaming requests bypass the response cache entirely — no cache lookup, no cache storage, no X-Cache header. This is by design, as streaming responses are typically unique and the cache cannot store partial streams.
Code Examples
curl
curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Write a haiku about programming."}],
"stream": true
}'
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about programming."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Node.js (OpenAI SDK)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "any-key",
baseURL: "http://localhost:8080/v1",
});
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a haiku about programming." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
console.log();
Java (RestClient)
RestClient client = RestClient.create();
ResponseEntity<String> response = client.post()
.uri("http://localhost:8080/v1/chat/completions")
.contentType(MediaType.APPLICATION_JSON)
.body("""
{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
""")
.retrieve()
.toEntity(String.class);
For full SSE consumption in Java, use Spring's SseEmitter client or a dedicated SSE library to process the event stream incrementally.
Provider-Specific Notes
All six providers support streaming. The gateway normalizes each provider's SSE format to the OpenAI chunk format:
| Provider | Native SSE Format | Gateway Output |
|---|---|---|
| OpenAI | data: {"choices":[{"delta":...}]} | Passthrough |
| Anthropic | Event-based: message_start, content_block_delta, message_stop | Normalized to OpenAI chunks |
| Gemini | data: {"candidates":[{"content":...}]} | Normalized to OpenAI chunks |
| Bedrock | Binary event stream | Normalized to OpenAI chunks |
| Ollama | OpenAI-compatible chunks | Passthrough |
| Mock | Simulated token-by-token with configurable delay | Normalized to OpenAI chunks |