Beyond the Basics: Streaming, Structured Outputs, Observability, and Local Models with DVARA
In the previous post, you got DVARA running and sent requests to multiple providers through a single OpenAI SDK client. This post covers the next layer of governance platform capabilities — LLM streaming, structured JSON outputs, Prometheus observability, and local development with Ollama.
All examples below assume you have DVARA running with at least one provider key, and the OpenAI Python SDK installed (pip install openai).
Streaming
Streaming is fully supported across all providers. The SDK's streaming interface works unchanged:
Create a file called stream_test.py:
from openai import OpenAI
client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about gateways"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
print() # newline at the end
Run it:
python stream_test.py
You'll see tokens appear one at a time as the model generates them.
Behind the scenes, DVARA translates between each provider's streaming format and emits standard OpenAI-format SSE chunks. If you switch model to claude-sonnet-4-5, the streaming response looks identical to your application — DVARA handles the Anthropic-to-OpenAI translation.
Structured Outputs
If you're using OpenAI's structured output feature (response_format), it works through the gateway with every provider that supports it.
Create a file called structured_test.py:
from openai import OpenAI
import json
client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "List 3 programming languages"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["languages"]
}
}
}
)
result = json.loads(response.choices[0].message.content)
print(result)
Run it:
python structured_test.py
Output:
{"languages": ["Python", "JavaScript", "Rust"]}
DVARA translates response_format into each provider's native format — tool-use rewrite for Anthropic, generationConfig for Gemini, toolConfig for Bedrock — and normalizes the response back to the OpenAI format. Your code doesn't need provider-specific handling. See the structured outputs docs for the full provider compatibility matrix.
Observability
Every request that passes through the gateway is metered. Prometheus metrics are available out of the box:
curl -s http://localhost:8080/actuator/prometheus | grep gateway_requests_total
You'll see counters broken down by model, provider, and HTTP status:
gateway_requests_total{model="gpt-4o",provider="openai",status="200"} 3.0
gateway_requests_total{model="claude-sonnet-4-5",provider="anthropic",status="200"} 1.0
Key metrics at a glance
| Metric | What it tells you |
|---|---|
gateway_requests_total | Request count by model, provider, status |
gateway_latency_seconds | Latency histograms (P50/P95/P99) |
gateway_tokens_total | Token consumption by model and direction (input/output) |
gateway_provider_errors_total | Error count by provider and error code |
gateway_fallbacks_total | Failover events between providers |
These are standard Prometheus metrics — scrape them into Grafana, Datadog, or any Prometheus-compatible monitoring stack. See the observability docs for the full list of 30+ metrics.
Local Models with Ollama
For development and testing, you can run a fully local setup with Ollama — no API keys, no external network calls, no cost.
Start Ollama
# Install Ollama (macOS)
brew install ollama
# Pull and run a model
ollama run llama3.2
Start DVARA with Ollama enabled
docker run -d --name dvara \
-p 8080:8080 \
-e OLLAMA_ENABLED=true \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
dvarahq/dvara-gateway:latest
:::tip Docker networking
host.docker.internal lets the Docker container reach Ollama running on your host machine. On Linux, you may need --network host instead.
:::
Use it from your SDK
Create a file called ollama_test.py:
from openai import OpenAI
client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)
response = client.chat.completions.create(
model="ollama/llama3.2",
messages=[{"role": "user", "content": "What is the capital of India?"}]
)
print(response.choices[0].message.content)
Run it:
python ollama_test.py
Same SDK, same endpoint, same code — just a different model prefix. You can develop against Ollama locally and switch to gpt-4o or claude-sonnet-4-5 in production by changing one string.
What's Next
New to DVARA? Start with the Getting Started guide first.
- Add the DVARA Flightdeck — start the DVARA Flightdeck on port 8090 for a dashboard with real-time metrics, route management, and audit logs.
- Configure routing — set up round-robin, weighted, latency-aware, or cost-aware routing across providers.
- Enable PII scanning — detect and redact sensitive data in prompts before they leave your network.
- Set budget caps — prevent bill shock with per-tenant token budgets and automatic model downgrade.
- Deploy with Docker Compose or Helm — production-ready deployment with PostgreSQL persistence.