Skip to main content

SSE Streaming

Dvara supports Server-Sent Events (SSE) streaming for all providers. Streaming delivers tokens incrementally as they are generated, reducing perceived latency.

Enabling Streaming

Set "stream": true in your chat completion request:

curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Count to five slowly."}
],
"stream": true
}'

SSE Protocol

The response uses Content-Type: text/event-stream. Each event is a JSON chunk prefixed with data: :

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":"One"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{"content":", two"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Format

Each chunk contains:

  • id — Stable completion ID across all chunks
  • object — Always "chat.completion.chunk"
  • model — The model used
  • choices[].delta — Incremental content (content for text, role for the first chunk)
  • choices[].finish_reasonnull during streaming, "stop" on the final content chunk

The stream terminates with data: [DONE].

Streaming with Structured Outputs

Streaming works with response_format. For providers that use tool-use rewrite (Anthropic, Bedrock), the gateway translates input_json_delta chunks into content delta chunks — streaming works end-to-end and your application sees the same format regardless of provider.

Timeout Configuration

Streaming requests have a separate, longer timeout than non-streaming requests:

gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # non-streaming: 30 seconds
streaming-timeout-ms: 120000 # streaming: 120 seconds

Caching Behavior

Streaming requests bypass the response cache entirely — no cache lookup, no cache storage, no X-Cache header. This is by design, as streaming responses are typically unique and the cache cannot store partial streams.

Code Examples

curl

curl -s -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Write a haiku about programming."}],
"stream": true
}'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)

stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about programming."}],
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

Node.js (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
apiKey: "any-key",
baseURL: "http://localhost:8080/v1",
});

const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a haiku about programming." }],
stream: true,
});

for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
console.log();

Java (RestClient)

RestClient client = RestClient.create();

ResponseEntity<String> response = client.post()
.uri("http://localhost:8080/v1/chat/completions")
.contentType(MediaType.APPLICATION_JSON)
.body("""
{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
""")
.retrieve()
.toEntity(String.class);

For full SSE consumption in Java, use Spring's SseEmitter client or a dedicated SSE library to process the event stream incrementally.

Provider-Specific Notes

All six providers support streaming. The gateway normalizes each provider's SSE format to the OpenAI chunk format:

ProviderNative SSE FormatGateway Output
OpenAIdata: {"choices":[{"delta":...}]}Passthrough
AnthropicEvent-based: message_start, content_block_delta, message_stopNormalized to OpenAI chunks
Geminidata: {"candidates":[{"content":...}]}Normalized to OpenAI chunks
BedrockBinary event streamNormalized to OpenAI chunks
OllamaOpenAI-compatible chunksPassthrough
MockSimulated token-by-token with configurable delayNormalized to OpenAI chunks