Skip to main content

Beyond the Basics: Streaming, Structured Outputs, Observability, and Local Models with DVARA

· 4 min read

In the previous post, you got DVARA running and sent requests to multiple providers through a single OpenAI SDK client. This post covers the next layer of governance platform capabilities — LLM streaming, structured JSON outputs, Prometheus observability, and local development with Ollama.

All examples below assume you have DVARA running with at least one provider key, and the OpenAI Python SDK installed (pip install openai).


Streaming

Streaming is fully supported across all providers. The SDK's streaming interface works unchanged:

Create a file called stream_test.py:

from openai import OpenAI

client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)

stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about gateways"}],
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
print() # newline at the end

Run it:

python stream_test.py

You'll see tokens appear one at a time as the model generates them.

Behind the scenes, DVARA translates between each provider's streaming format and emits standard OpenAI-format SSE chunks. If you switch model to claude-sonnet-4-5, the streaming response looks identical to your application — DVARA handles the Anthropic-to-OpenAI translation.


Structured Outputs

If you're using OpenAI's structured output feature (response_format), it works through the gateway with every provider that supports it.

Create a file called structured_test.py:

from openai import OpenAI
import json

client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "List 3 programming languages"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["languages"]
}
}
}
)

result = json.loads(response.choices[0].message.content)
print(result)

Run it:

python structured_test.py

Output:

{"languages": ["Python", "JavaScript", "Rust"]}

DVARA translates response_format into each provider's native format — tool-use rewrite for Anthropic, generationConfig for Gemini, toolConfig for Bedrock — and normalizes the response back to the OpenAI format. Your code doesn't need provider-specific handling. See the structured outputs docs for the full provider compatibility matrix.


Observability

Every request that passes through the gateway is metered. Prometheus metrics are available out of the box:

curl -s http://localhost:8080/actuator/prometheus | grep gateway_requests_total

You'll see counters broken down by model, provider, and HTTP status:

gateway_requests_total{model="gpt-4o",provider="openai",status="200"} 3.0
gateway_requests_total{model="claude-sonnet-4-5",provider="anthropic",status="200"} 1.0

Key metrics at a glance

MetricWhat it tells you
gateway_requests_totalRequest count by model, provider, status
gateway_latency_secondsLatency histograms (P50/P95/P99)
gateway_tokens_totalToken consumption by model and direction (input/output)
gateway_provider_errors_totalError count by provider and error code
gateway_fallbacks_totalFailover events between providers

These are standard Prometheus metrics — scrape them into Grafana, Datadog, or any Prometheus-compatible monitoring stack. See the observability docs for the full list of 30+ metrics.


Local Models with Ollama

For development and testing, you can run a fully local setup with Ollama — no API keys, no external network calls, no cost.

Start Ollama

# Install Ollama (macOS)
brew install ollama

# Pull and run a model
ollama run llama3.2

Start DVARA with Ollama enabled

docker run -d --name dvara \
-p 8080:8080 \
-e OLLAMA_ENABLED=true \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
dvarahq/dvara-gateway:latest

:::tip Docker networking host.docker.internal lets the Docker container reach Ollama running on your host machine. On Linux, you may need --network host instead. :::

Use it from your SDK

Create a file called ollama_test.py:

from openai import OpenAI

client = OpenAI(
api_key="any-key",
base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
model="ollama/llama3.2",
messages=[{"role": "user", "content": "What is the capital of India?"}]
)
print(response.choices[0].message.content)

Run it:

python ollama_test.py

Same SDK, same endpoint, same code — just a different model prefix. You can develop against Ollama locally and switch to gpt-4o or claude-sonnet-4-5 in production by changing one string.


What's Next

New to DVARA? Start with the Getting Started guide first.

  • Add the DVARA Flightdeck — start the DVARA Flightdeck on port 8090 for a dashboard with real-time metrics, route management, and audit logs.
  • Configure routing — set up round-robin, weighted, latency-aware, or cost-aware routing across providers.
  • Enable PII scanning — detect and redact sensitive data in prompts before they leave your network.
  • Set budget caps — prevent bill shock with per-tenant token budgets and automatic model downgrade.
  • Deploy with Docker Compose or Helm — production-ready deployment with PostgreSQL persistence.