Routing & Load Balancing
DVARA decides which provider handles each LLM request by walking a configurable list of routes, matching the request's model field against each route's pattern, and applying the route's strategy to pick a provider from its pool. By default, with no routes configured, the gateway falls back to matching on the model-name prefix — gpt-* goes to OpenAI, claude-* to Anthropic, gemini-* to Google, and so on. Route configuration is persisted in PostgreSQL and propagates across every gateway node live, so updates take effect on the next request without a restart.
This page covers every routing strategy DVARA supports — from the default prefix match to enterprise features like latency-aware EWMA routing, cost-aware cheapest-provider selection, canary A/B splits, shadow traffic for safe provider evaluation, region-aware routing for data residency, and an intelligent complexity classifier that picks the right model tier per request. See Providers for the provider-specific quirks and the capability matrix routing uses to filter incompatible providers out.
Default routing: model prefix
This is the fallback the gateway uses when no configured route matches. If you've created a route for a model pattern, that route runs first and the table below isn't consulted. Treat this as the bundled set of default behaviours; anything outside the table needs a route. To add a route, see Route configuration.
| Model in request | Matched provider |
|---|---|
gpt-4o, gpt-4.1, o1-preview, o3-mini, o4-mini, chatgpt-4o-latest | OpenAI |
text-embedding-3-small, text-embedding-3-large | OpenAI (embeddings) |
claude-sonnet-4-5, claude-3-5-haiku-20241022 | Anthropic |
gemini-2.0-flash, gemini-1.5-pro | Google Gemini |
bedrock/anthropic.claude-3-sonnet-20240229-v1:0 | AWS Bedrock |
azure/gpt-4o | Azure OpenAI |
mistral-large-latest, mistral-small-latest | Mistral |
command-r-plus, command-r | Cohere |
groq/llama-3.3-70b-versatile | Groq |
ollama/llama3.2 | Ollama |
qwen2.5-72b-instruct, qwen-max | Alibaba Qwen (DashScope) |
deepseek-chat, deepseek-reasoner | DeepSeek |
moonshot-v1-128k, moonshot-v1-32k | Moonshot (Kimi) |
glm-4, glm-4-air, glm-4v-plus | Zhipu ChatGLM |
grok-2-1212, grok-2-vision-1212, grok-3-latest | xAI Grok |
mock/test-model | Mock |
OpenAI claims three prefix families on the chat path — classic gpt-*, the o-series reasoning models (o1-*, o3-*, o4-*), and the chatgpt-* alias models. Embeddings route under text-embedding-* separately. All four route to the same provider and the same OPENAI_API_KEY.
Models from OpenAI-compatible providers that aren't first-class — ERNIE (Baidu), Fireworks, Together AI, Perplexity, vLLM hosts, internal corporate gateways — aren't routed by default prefix matching. Configure a route or override an existing provider's base URL. See Additional providers.
Each provider registers at startup only if its credential trigger is set (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.; dvara.llm-gateway.providers.ollama.enabled=true for Ollama; BEDROCK_ENABLED=true for Bedrock — note Bedrock registers on the flag alone, and AWS credentials are checked per request by the SigV4 signer, not at startup). If no provider is registered for the matched prefix, the gateway returns NO_PROVIDER (HTTP 400) with an error message naming the env var to set. Per-tenant Flightdeck credentials override the startup credential per request but do not register a provider on their own — set a placeholder startup value if you plan to onboard tenants who bring their own keys. See Credentials & BYOK.
Example request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Summarize this PR description in two bullets."}]
}'
claude-sonnet-4-5 has no custom route, so the default prefix matcher runs. The claude prefix resolves to Anthropic and the gateway forwards the request upstream. You can confirm which provider handled the call by grepping the access log for the X-Trace-ID that came back on the response — the provider field of the structured JSON line shows anthropic.
Routing strategies
DVARA supports eight strategies. Pick one per route based on how you want traffic distributed across your provider pool:
| Strategy | Description |
|---|---|
model-prefix | Default. First provider whose prefix matches the model. |
round-robin | Distributes requests evenly across listed providers in rotation. |
weighted | Distributes requests by configured relative weights. |
latency-aware | Routes to the lowest-latency healthy provider using a running EWMA of response times. |
cost-aware | Routes to the cheapest provider, optionally constrained by a latency SLA. |
canary | Splits traffic between a baseline and candidate provider for A/B testing. |
geo-aware | Filters providers by region and prefers the gateway node's own region, subject to tenant data-residency constraints. |
intelligent | Classifies request complexity and picks the cheapest model tier that can handle it. |
Shadow traffic (Shadow Routing) is a property of any strategy above, not a strategy itself — it sends a copy of traffic to a secondary provider without affecting the client response, for side-by-side comparison.
Priority admission control (SLA-Aware Priority Routing) is independent of strategy selection — it governs whether a request is admitted at all, based on per-tenant priority tier under load.
Route configuration
The durable source of truth for routes is the PostgreSQL routes table. Create and update routes through the DVARA Flightdeck at /routes or the Automation API at POST /v1/admin/routes, both of which persist and propagate across every node (see Hot reload).
The dvara.llm-gateway.routes block in application.yml is a convenience for static startup configuration — it's loaded into the in-memory engine at boot and is fine for development or a simple single-node deployment, but it is replaced wholesale by whatever is in the routes table as soon as any admin write triggers a config-change notification. For anything beyond a local dev box, manage routes through the API.
Example YAML (static)
dvara:
llm-gateway:
routes:
- id: load-balance-gpt
model-pattern: "gpt*"
strategy: round-robin
providers:
- provider: openai
- provider: bedrock
- id: weighted-claude
model-pattern: "claude*"
strategy: weighted
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30
- id: pin-gpt4o
model-pattern: "gpt-4o"
strategy: model-prefix
pinned-model-version: "gpt-4o-2024-08-06"
Route fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique route identifier |
model-pattern | string | yes | Prefix pattern. A trailing * matches any remainder (gpt* matches gpt-4o, gpt-4o-mini, gpt-5). Anything else is treated as a literal exact match — there is no support for ?, **, or regex. |
strategy | string | no | One of model-prefix (default), round-robin, weighted, latency-aware, cost-aware, canary, geo-aware, intelligent. |
model-tiers | map | no | Complexity → model mapping for intelligent routing (e.g., {SIMPLE: "gpt-4o-mini", MODERATE: "gpt-4o", COMPLEX: "claude-3-opus"}) |
cost-tolerance-pct | int | no | Cost tolerance percentage for latency/cost-aware routing (0–100, default: 0) |
latency-sla-ms | long | no | Latency SLA in ms for cost-aware routing (providers exceeding the SLA are excluded; default: 0 = no SLA) |
canary-config | object | no | Canary A/B test configuration (see Canary Routing) |
shadow-config | object | no | Shadow traffic configuration for parallel testing (see Shadow Routing) |
providers | list | yes* | Provider pool for this route (*required for every strategy except model-prefix) |
providers[].provider | string | yes | Provider name: openai, anthropic, gemini, bedrock, azure-openai, mistral, cohere, groq, ollama, qwen, deepseek, moonshot, chatglm, grok, or mock |
providers[].weight | int | no | Weight for weighted routing (default: 1) |
providers[].region | string | no | Region affinity for this provider entry (used by geo-aware routing) |
pinned-model-version | string | no | Overrides the model name before sending to the provider |
How a request reaches a provider
Every POST /v1/chat/completions walks five stages before any byte hits an upstream provider:
1. Match a route → which route's pattern fits the model?
2. Pin the model → does the route rewrite the model name?
3. Filter by capability → can each provider honor response_format?
4. Run the strategy → which provider in the filtered pool wins?
5. Failover on error → if the pick fails, retry on a healthy peer
Stages 1, 4, and 5 always run. Stage 2 runs only when the matched route configures pinning. Stage 3 runs only when the request carries a response_format.
The rest of this section walks each stage with a small example, then ends with one request that exercises all five.
1. Match a route
The gateway walks the configured routes top-to-bottom and stops at the first one whose model-pattern matches the request's model. If nothing matches — or the request has no model field, like an embedding call — the built-in model-prefix fallback takes over and picks a provider from the registered prefix table.
Example. Two routes are configured: gpt-canary (pattern gpt-4o-mini, canary strategy) and gpt-default (pattern gpt*, round-robin). A request for gpt-4o-mini matches gpt-canary first and stops there. A request for gpt-4o walks past gpt-canary (pattern doesn't match — gpt-4o-mini is a literal exact match, not a prefix) and lands on gpt-default. A request for mistral-large-latest matches neither route, so the prefix fallback runs — the mistral prefix resolves to the Mistral provider.
2. Pin the model
If the matched route has pinned-model-version, the gateway rewrites the request's model field to the pinned value before any strategy runs. The original alias the client sent never reaches the provider; billing, audit events, and the access log all show the pinned version.
Example. Route pin-gpt4o matches gpt-4o and pins to gpt-4o-2024-08-06. A request for gpt-4o has its model rewritten in place — OpenAI receives gpt-4o-2024-08-06, the cost record stamps gpt-4o-2024-08-06, the audit event references gpt-4o-2024-08-06. Useful when "the latest gpt-4o" silently changes underneath you and you need deterministic behavior across deployments.
3. Filter by capability
If the request carries a response_format, the route's provider pool is narrowed to only providers that natively support that format. Stages 4 and 5 then run against this filtered pool — a provider that can't honor json_schema will never be picked, and never be picked as a failover either.
response_format | Required capability |
|---|---|
json_schema | Structured outputs |
json_object | JSON mode |
text / absent | No filtering |
If no provider on the route can honor the requested format, the gateway short-circuits with HTTP 400 NO_CAPABLE_PROVIDER — no upstream call is made. See Capability-aware filtering for the full rules.
Example. A route's pool is [openai, ollama]. A request with response_format: {"type": "json_schema", ...} filters Ollama out — Ollama doesn't support structured outputs — leaving only OpenAI for the strategy to pick from. A request with response_format: {"type": "text"}, or no response_format at all, leaves both providers in the pool.
4. Run the strategy
The route's strategy runs against the filtered pool and picks exactly one provider:
round-robinrotates through the pool one request at a timeweightedrolls a uniform random against the configured weightslatency-awarepicks the lowest current EWMA latencycost-awarepicks the cheapest, optionally bounded by a latency SLAcanarysplits traffic on a configured percentage between baseline and candidategeo-awarefilters by data-residency policy and prefers same-region providersintelligentclassifies the request's complexity and picks the cheapest tier that fitsmodel-prefixis the default — first provider whose registered prefix matches
Each strategy's payload semantics are documented in its own section below.
Example. Route claude-weighted has [anthropic weight 70, bedrock weight 30]. The strategy rolls a number in [0, 100): if it lands under 70, Anthropic wins; otherwise Bedrock. The strategy outputs a single provider — say Anthropic — and the gateway opens its connection there.
5. Failover on error
If the chosen provider returns PROVIDER_ERROR or its circuit breaker is already open, the resilience layer retries against a healthy alternative on the same route. Stage 3 (capability) and any region constraints re-apply on every retry, so a fallback that can't honor json_schema — or that sits in a disallowed region — is never picked as a fallback. See Resilience for retry budgets and circuit-breaker tuning.
Example. The strategy picked OpenAI in stage 4. OpenAI's circuit breaker has just opened due to a string of 5xxs. The gateway hands off to the failover layer, which scans the same route's pool for healthy providers, re-applies the capability filter, and lands on Bedrock. The client sees the Bedrock response — or FAILOVER_CAPABILITY_MISMATCH (HTTP 503) if no capable peer exists.
End-to-end: one request through all five stages
A route configured like this:
dvara:
llm-gateway:
routes:
- id: claude-split
model-pattern: "claude*"
strategy: weighted
pinned-model-version: "claude-3-5-haiku-20241022"
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30
And a client sending:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "claude-3-5-haiku",
"response_format": {"type": "json_object"},
"messages": [{"role": "user", "content": "Reply with a JSON object containing keys ok and ts."}]
}'
What the gateway does:
- Match.
claude-3-5-haikumatchesclaude*→ routeclaude-splitselected. - Pin.
pinned-model-versionis set →modelrewritten toclaude-3-5-haiku-20241022. - Filter.
response_format: json_objectrequires JSON mode. Both Anthropic and Bedrock support it, so the pool stays[anthropic, bedrock]. - Strategy. The weighted roll lands at 0.42, falls in the
[0, 70)band → Anthropic wins. - Failover. Anthropic returns the JSON object cleanly, no failover needed. (If Anthropic had failed, Bedrock would have been tried with the same pinned model and the same JSON-mode capability still applied.)
The client gets Anthropic's JSON object back. The audit event records provider=anthropic, model=claude-3-5-haiku-20241022 (the pinned version, not the alias). The Prometheus counter gateway_requests_total{provider="anthropic", model="claude-3-5-haiku-20241022", ...} increments by one.
Round-robin example
Distribute GPT requests evenly between OpenAI and Bedrock:
dvara:
llm-gateway:
routes:
- id: gpt-round-robin
model-pattern: "gpt*"
strategy: round-robin
providers:
- provider: openai
- provider: bedrock
Requests cycle through the providers in order: OpenAI, Bedrock, OpenAI, Bedrock, and so on. The counter is per-route, not global — two different routes with round-robin strategy rotate independently.
Example request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello"}]
}'
gpt-4o-mini matches gpt* on the gpt-round-robin route, so the strategy picks OpenAI on call 1, Bedrock on call 2, OpenAI on call 3, and so on. The counter advances by one per request regardless of outcome. If a named provider is not in the live pool — typically because its credentials weren't configured at startup, or because a response-format capability filter has removed it — the strategy scans forward through the list until it finds one that is present. Round-robin does not check provider health. An unhealthy provider (open circuit breaker) is still eligible to be picked; if the upstream call then errors out, the gateway's failover layer takes over and retries on a healthy alternative.
Weighted routing example
Send 70% of Claude traffic to Anthropic and 30% to Bedrock:
dvara:
llm-gateway:
routes:
- id: claude-weighted
model-pattern: "claude*"
strategy: weighted
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30
Weights are relative, not percentages — weight: 7 and weight: 3 behave identically to 70/30. A provider with weight: 0 is included in the pool but never selected, as long as at least one other provider on the route has a positive weight — you can use this to temporarily take a provider out of rotation without editing the route structure. A weighted route where every provider has weight: 0 is rejected at construction time (the total weight must be positive). Like round-robin, weighted routing does not check provider health; unhealthy picks fall through to the failover layer.
Example request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "claude-sonnet-4-5",
"messages": [{"role": "user", "content": "Write a short release note."}]
}'
claude-sonnet-4-5 matches claude*, so the claude-weighted route rolls a per-request uniform random number against the cumulative weight ranges: [0, 70) → Anthropic, [70, 100) → Bedrock. Over thousands of requests the distribution converges to 70/30, but short bursts can drift — weighted is probabilistic, not a strict ratio. If you flip Anthropic's weight to 0 with a live route update, the next request skips Anthropic entirely and the 30% slice for Bedrock becomes 100%.
Model version pinning
Lock a model alias to a specific version so your production traffic never drifts when a provider updates the default:
dvara:
llm-gateway:
routes:
- id: pin-gpt4o
model-pattern: "gpt-4o"
strategy: model-prefix
pinned-model-version: "gpt-4o-2024-08-06"
When a request for gpt-4o matches this route, the gateway rewrites the model to gpt-4o-2024-08-06 before forwarding to OpenAI. This is how you guarantee consistent behavior across deployments — you pin the exact version in the route, clients keep calling the friendly alias.
Capability-aware filtering
When response_format is present on a request, the gateway automatically filters the provider pool before applying the routing strategy. Only providers that support the requested format are considered:
response_format type | Required capability |
|---|---|
json_schema | Structured outputs support |
json_object | JSON mode support |
text / absent | No filtering applied |
This prevents routing to providers that cannot handle the request. If no provider on the route supports the requested format, the gateway returns HTTP 400 with error code NO_CAPABLE_PROVIDER. If the chosen provider fails and no capable fallback exists, the gateway returns HTTP 503 with FAILOVER_CAPABILITY_MISMATCH and an X-Gateway-Failover-Blocked: capability_mismatch header so you can distinguish capability failures from generic provider errors.
See Structured Outputs for per-provider details on which format mappings are native versus emulated, and the Capabilities Matrix on the Providers page.
Latency-aware routing
The latency-aware strategy tracks live EWMA (exponentially weighted moving average) latency per provider-and-model pair and automatically routes each request to the fastest healthy provider. This is ideal for multi-provider setups where provider performance varies over time — regional congestion, provider-side load, or a new model deployment are all picked up within a handful of samples.
How it works
- Latency recording — every successful provider call records its wall-clock response time. The running average is computed as
ewma = alpha * sample + (1 - alpha) * previous, withalpha = 0.2by default. - Provider selection — for each request, healthy providers are ranked by EWMA. The lowest-latency provider is selected.
- Cold-start exploration — providers without enough samples (below
min-samples, default 5) are bypassed for EWMA selection. A 10% exploration slice still routes to them so the EWMA has data to converge on. When every provider on the route is cold, the strategy falls back to round-robin until samples accumulate. - Time decay — stale entries (no samples for more than 60 seconds by default) get a decay penalty so the router does not keep preferring a provider it hasn't measured recently.
Configuration
dvara:
llm-gateway:
routes:
- id: latency-gpt
model-pattern: "gpt*"
strategy: latency-aware
providers:
- provider: openai
- provider: bedrock
cost-tolerance-pct is read by the cost-aware strategy, not latency-aware — if you want latency and cost trade-off together, use cost-aware with a latency-sla-ms cap (Cost-aware routing).
Example request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Generate a JSON response."}]
}'
The strategy pulls the current EWMA latency for gpt-4o on both providers. If OpenAI is averaging 180 ms and Bedrock is averaging 250 ms, OpenAI wins. When a fresh provider has fewer than dvara.llm-gateway.routing.latency.min-samples (5 by default), it's considered "cold" — it gets a 10% exploration slice so the EWMA has data to converge on before the strategy commits to a winner. If every provider on the route is cold, the strategy falls through to round-robin until samples accumulate.
EWMA tuning
These global properties apply to every latency-aware route:
| Property | Default | Description |
|---|---|---|
dvara.llm-gateway.routing.latency.alpha | 0.2 | EWMA smoothing factor (0.0–1.0). Higher = more weight on recent samples |
dvara.llm-gateway.routing.latency.decay-threshold-ms | 60000 | Staleness threshold in ms. Entries older than this get a decay penalty |
dvara.llm-gateway.routing.latency.decay-multiplier | 0.5 | Stale EWMA is divided by this value (lower = harsher penalty) |
dvara.llm-gateway.routing.latency.min-samples | 5 | Minimum samples before EWMA is used for routing decisions |
dvara.llm-gateway.routing.latency.snapshot-interval | 100 | Persist a latency snapshot every N samples |
Monitoring latency data
Query current EWMA latency for every tracked provider + model pair:
curl http://localhost:8090/v1/admin/latency
Response:
{
"object": "list",
"data": [
{
"provider": "openai",
"model": "gpt-4o",
"ewmaLatencyMs": 120.5,
"rawLatencyMs": 115.0,
"sampleCount": 42,
"lastUpdated": "2026-01-01T00:00:00Z"
}
]
}
Cost-aware routing
The cost-aware strategy selects the cheapest provider capable of handling the request, optionally constrained by a latency SLA. Per-provider pricing comes from the PostgreSQL-backed pricing table managed via the DVARA Flightdeck UI + Automation API — see Cost Management for how prices are added and rotated.
How it works
- Filter to healthy + configured providers. Providers the resilience layer has marked unavailable are dropped. If the candidate pool is empty after this filter, the strategy throws
NO_PROVIDER. - SLA filter. When
latency-sla-ms > 0, providers whose current EWMA latency exceeds the SLA are excluded. Providers without enough latency data pass through so they get a chance to prove themselves. If the SLA is so tight that every provider exceeds it, the filter is bypassed and the strategy continues with the full healthy pool — the gateway logs a debug line and picks the cheapest from the original candidates rather than returningNO_PROVIDER. Setlatency-sla-msto a value you've actually measured under load — too-tight settings degrade silently to "cheapest", not to a hard error. - Cost estimation. For every remaining provider, the
CostEstimatorlooks up the pricing-table entry and computes estimated per-request cost from the request's estimated input + output token counts. - Split by cost-data availability. Providers with a pricing entry go to a warm pool; providers without a pricing entry go to a cold pool. (Note: this is about whether the provider has pricing data, not about how many samples it has — there is no
min-samplesknob on cost-aware. The latency-awaremin-samplesproperty is for the latency-aware strategy only.) - All cold → round-robin fallback. If no provider has cost data, the strategy falls back to round-robin so traffic still flows while pricing data accumulates.
- 5% cold-start exploration. When warm + cold pools both exist, 5% of requests are routed to a cold provider so its cost data can accumulate. The other 95% go to the warm pool. Cold-start providers are not skipped — they're sampled at a small fixed rate.
- Cheapest with cost-tolerance tiebreak. From the warm pool, the strategy picks the cheapest provider. If
cost-tolerance-pct > 0, providers within the tolerance band —cost ≤ minCost × (1 + costTolerancePct/100)— are re-ranked by latency. Socost-tolerance-pct: 15means "I'll accept anything within 15% of the cheapest; among those, pick the fastest."
dvara:
llm-gateway:
routes:
- id: cost-optimized
model-pattern: "gpt*"
strategy: cost-aware
latency-sla-ms: 500 # exclude providers with >500ms EWMA latency
cost-tolerance-pct: 15 # within 15% of cheapest, prefer lowest latency
providers:
- provider: openai
- provider: bedrock
Example request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Please analyze this paragraph for tone."}]
}'
Walk the decision: the strategy estimates cost for each provider from the pricing table (step 2). Suppose the request comes in at roughly 400 tokens in and 200 tokens out, with pricing $2.50/M + $10/M for OpenAI gpt-4o and $3.00/M + $15/M for Bedrock — estimated cost is $0.003 for OpenAI and $0.0042 for Bedrock. OpenAI wins. Now suppose OpenAI's EWMA crosses the 500 ms SLA: OpenAI is dropped at step 1, Bedrock passes, Bedrock wins even at the higher price. With cost-tolerance-pct: 15, if both providers' estimates sit within 15% of the cheapest, the tiebreak flips to lowest latency — so you can tell the strategy "roughly the same price, pick the faster one".
Canary routing
The canary strategy splits traffic between a baseline and a candidate provider for A/B testing. Use it to validate a new provider on a fraction of real traffic before promoting it to the primary path.
canary-config blocks under dvara.llm-gateway.routes in application.yml are not bound at startup. Canary routes must be created through the Automation API (POST /v1/admin/routes) or from the DVARA Flightdeck at /routes — both persist the canary block to the routes table and hot-reload across every node. The YAML dvara.llm-gateway.routes key is fine for model-prefix, round-robin, weighted, latency-aware, cost-aware, geo-aware, and intelligent routes, but canary (and shadow, below) need the API.
Creating a canary route
The route ID is assigned by the server as a UUID — don't send id in the create payload, it'll be silently ignored. Capture the assigned UUID from the response's Location header for the monitoring + update endpoints below.
curl -i -X POST http://localhost:8090/v1/admin/routes \
-H 'Content-Type: application/json' \
-d '{
"model_pattern": "gpt*",
"strategy": "canary",
"canary_config": {
"baseline_provider": "openai",
"candidate_provider": "bedrock",
"split_pct": 20,
"tenant_scope": "tenant-a",
"test_name": "bedrock-eval"
},
"providers": [
{"provider": "openai"},
{"provider": "bedrock"}
]
}'
# Response:
# HTTP/1.1 201 Created
# Location: /v1/admin/routes/a1b2c3d4-e5f6-7890-abcd-ef1234567890
# ...
#
# Capture the UUID — the {id} placeholder in the monitoring commands
# below is that UUID.
| Field | Type | Required | Description |
|---|---|---|---|
baseline_provider | string | yes | Provider name for baseline traffic. Must also appear in the route's providers list — see the note below. |
candidate_provider | string | yes | Provider name for candidate traffic. Must also appear in the route's providers list. |
split_pct | int | yes | Percentage of traffic sent to the candidate (0–100) |
tenant_scope | string | no | Restrict canary to a specific tenant (null = all tenants). When set, requests from any other tenant always route to the baseline. |
test_name | string | no | Human-readable test name shown in the comparison report |
providers aligned with both variantsIf the variant the strategy picks isn't in the route's providers list, the gateway silently falls back to the other variant instead of erroring out. So if baseline_provider: "openai" but providers only includes bedrock, every "openai" decision will land on Bedrock without an error — your 0% split still routes 100% to the candidate. Include both baseline_provider and candidate_provider explicitly in the providers list.
Example request against the canary route:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $TENANT_A_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Summarize this paragraph."}],
"metadata": {"tenant_id": "tenant-a"}
}'
gpt-4o matches gpt* on the canary-test route. Because tenant_scope is set to tenant-a, the strategy reads metadata.tenant_id from the request body and compares it to tenant_scope. Only requests that set tenant_id to tenant-a enter the split — every other tenant, and any request that omits the field, lands on the baseline (OpenAI) unconditionally. Inside tenant-a, the strategy rolls a uniform random number in [0, 100): values under 20 go to Bedrock, the rest go to OpenAI. Every decision records a canary metric (latency, cost, error) so the comparison report can show baseline vs. candidate side by side.
tenant_scope checks metadata.tenant_id on the incoming request body, not the tenant resolved from the API key by the auth layer. If your client doesn't include metadata.tenant_id in the request body, the canary split never kicks in and every request lands on the baseline. When you set tenant_scope, you must pass tenant_id explicitly from the calling code.
Monitoring
# View canary comparison report (latency, success rate, token usage per variant)
curl http://localhost:8090/v1/admin/routes/{id}/canary/report
# Reset canary metrics (starts a fresh comparison window)
curl -X POST http://localhost:8090/v1/admin/routes/{id}/canary/reset
# Update split percentage live — no restart, next request uses the new split
curl -X PUT http://localhost:8090/v1/admin/routes/{id}/canary \
-H "Content-Type: application/json" \
-d '{"split_pct": 50}'
Geo-aware routing
The geo-aware strategy respects data residency policies by filtering providers to only those in allowed regions before selecting one. Use it when regulatory requirements forbid sending data outside a specific region (GDPR, India DPDP, sector-specific rules).
dvara:
llm-gateway:
routes:
- id: geo-gpt
model-pattern: "gpt*"
strategy: geo-aware
providers:
- provider: openai
region: us-east-1
- provider: bedrock
region: eu-west-1
Region inputs
Each request is evaluated against two region signals:
- Current region — the calling gateway node's own region, set at gateway startup via
DVARA_REGION_ID(ordvara.region.idinapplication.yml). This is the region the gateway is running in, not the tenant's home region. - Tenant region — passed via
request.metadata["tenant_region"]on the incoming request. Despite the field name, the value must be the tenant ID as it appears in thetenantstable — the gateway uses it totenantRepository.findById(tenant_region)and readsdata-residency.allowed-regionsoff the resolved tenant's metadata. Settingtenant_regionto a literal region string ("eu-west-1") silently fails: the tenant lookup misses, the resolved allow-list is empty, and the residency policy degrades to "everything allowed." Pass the tenant's ID, not its region. (The field name is a historical artifact; canary'stenant_idfield is the same shape.)
Allowed-region configuration
Data residency is driven by a per-tenant metadata key, not route configuration. Set data-residency.allowed-regions on the tenant as a comma-separated list of region strings — only these regions are permitted upstream for the tenant's traffic:
curl -X PUT http://localhost:8090/v1/admin/tenants/{id} \
-H 'Content-Type: application/json' \
-d '{
"name": "Acme EU",
"status": "active",
"metadata": {
"data-residency.allowed-regions": "eu-west-1,eu-central-1"
}
}'
The residency filter is silently bypassed in any of these three cases — for enforcement to actually take effect, all three must be in place:
| Bypass case | Effect |
|---|---|
Tenant has no data-residency.allowed-regions set (or an unknown tenant ID looks up to an empty Set) | All providers pass; strategy selects purely on same-region preference |
Route's provider entries lack region: on one or more entries | Those entries pass the residency filter unconditionally — set region: on every entry in a geo-aware route |
Request omits metadata.tenant_region | All providers pass — clients must include the tenant ID on every request that should be residency-scoped |
Selection order
For each request, the strategy walks the provider pool in this order:
- Health filter — drop providers the resilience layer has marked unavailable. If every provider is unhealthy, the strategy falls back to the full pool (including unhealthy ones) rather than failing fast — health is preferred, not strictly enforced. The residency filter in step 2 still runs, so a disallowed region is never silently chosen even when health is degraded.
- Data-residency filter — per-provider check via
DataResidencyPolicy.isAllowed(tenantRegion, providerRegion). Providers whoseregionis not in the tenant's allow-list are dropped. If this leaves the pool empty, the strategy throwsDATA_RESIDENCY_VIOLATION(HTTP 403) and no upstream call is made. Subject to the three bypass cases above. - Same-region preference — if
DVARA_REGION_IDis set, pick the first residency-compliant provider whoseregionmatches the gateway node's own region (and whose region is currently healthy). - Cross-region fallback — no same-region provider? Pick the first residency-compliant provider whose region is currently healthy.
- Last resort — still nothing? Return the first residency-compliant provider regardless of region health (logs a warning).
Step 2 runs before any provider call, so sensitive data never leaves an allowed region — provided the three bypass cases above are all closed.
Example request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Classify this ticket."}],
"metadata": {"tenant_region": "acme-eu"}
}'
metadata.tenant_region is the tenant ID used to look up the data-residency.allowed-regions entry in the tenant's metadata. If tenant acme-eu has "eu-west-1,eu-central-1" set, the OpenAI entry (pinned to us-east-1) is dropped at the residency filter because us-east-1 is not in the allow-list, and Bedrock (in eu-west-1) is selected. If the gateway node itself is running in eu-west-1 (via DVARA_REGION_ID), the same-region preference locks in Bedrock even if a second residency-compliant provider were also available.
Like canary tenant_scope, tenant_region must be present in the request body's metadata — the geo-aware strategy does not pull it from the API-key-resolved tenant.
Intelligent routing
The intelligent strategy classifies request complexity and automatically picks the cheapest model tier that can handle it. Use it when you want a single endpoint that scales from cheap short-form answers to expensive complex reasoning without clients having to pick the right model themselves.
dvara:
llm-gateway:
routes:
- id: auto-tier
model-pattern: "*"
strategy: intelligent
model-tiers:
SIMPLE: gpt-4o-mini # factual Q&A, translation
MODERATE: gpt-4o # summarization, analysis
COMPLEX: claude-3-opus # creative writing, code generation
providers:
- provider: openai
- provider: anthropic
Complexity scoring
The classifier scores each request on these factors:
| Factor | Points |
|---|---|
| 1 message | 0 |
| 2–5 messages | 1 |
| 6+ messages | 2 |
| Non-system message text ≤ 500 chars | 0 |
| Non-system message text 501–2000 chars | 1 |
| Non-system message text > 2000 chars | 2 |
| Tool use present in the conversation | 2 |
| System prompt ≤ 500 chars | 0 |
| System prompt 501–1500 chars | 1 |
| System prompt > 1500 chars | 2 |
| Complexity keywords (full list below) | 0.5 each, max 2 |
The character counts on the second row are deliberately for non-system message text (user, assistant, and tool messages — anything except role: "system"); the system prompt is scored on its own row and is intentionally not added in. Boundary values fall in the lower band — exactly 500 chars scores 0, exactly 2000 scores 1, exactly 1500 (system prompt) scores 1.
The complexity-keyword list, in full: analyze, compare, create, design, architecture, refactor, debug, optimize, implement, algorithm, multi-step, reasoning, evaluate, synthesize, critique. Case-insensitive whole-word match, 0.5 points each, capped at 2 total. Useful to know up-front because common code-assist prompts like "implement and evaluate this algorithm" tip the score by +1.5 from keywords alone, often the difference between SIMPLE and MODERATE.
Thresholds: 0–2 = SIMPLE, 3–5 = MODERATE, 6+ = COMPLEX.
Fallback behavior
- If the tier model's provider is unavailable, the router tries one lower-tier hop, not a full cascade. So
COMPLEXfalls back toMODERATE;MODERATEfalls back toSIMPLE;SIMPLEhas no further tier fallback. If the one-hop attempt also has no available provider, step 2 takes over. - Round-robin across the route's configured providers, regardless of tier. This is the safety net for "neither the target tier nor its immediate fallback can be served".
- The resolved model overrides the request's
modelfield before calling the provider, so your billing and audit trail show the actual model used, not the wildcard the client sent.
Worked example: a COMPLEX classification points at claude-3-opus. Anthropic is down. The router tries MODERATE → gpt-4o. OpenAI is also down. Rather than then try SIMPLE → gpt-4o-mini, the router goes straight to round-robin across the configured providers. If gpt-4o-mini happens to live on a third configured provider that is up, you get there by chance (round-robin walks the pool), not by tier cascade. The practical implication: if you want resilient COMPLEX traffic, make sure your MODERATE tier is on a different provider than COMPLEX — and ideally have at least one configured provider outside both tiers as a safety net.
Example requests
A short factual question classifies as SIMPLE (score 0 — one message, under 500 chars, no tool use, no keywords) and routes to gpt-4o-mini:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "What is the capital of Japan?"}]
}'
A long multi-turn conversation with a detailed system prompt, tool use, and complexity keywords classifies as COMPLEX (score ≥ 6) and routes to claude-3-opus:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "auto",
"messages": [
{"role": "system", "content": "You are a senior staff engineer. When asked to analyze, refactor, or design, walk through the problem systematically and cite the relevant sections of the code."},
{"role": "user", "content": "Here is a 2000-line service. Analyze the race condition in the tenant-resolution path and propose a refactor with tests."},
{"role": "assistant", "content": "..."},
{"role": "tool", "content": "..."},
{"role": "user", "content": "Now design a migration plan and implement the first two steps."}
]
}'
Scoring breakdown for the second request: 5 messages (+1), non-system text well over 2000 chars (+2, the long user message alone clears the threshold — the system prompt isn't counted here, it's scored separately on the row below), tool use present (+2), system prompt 501–1500 chars (+1), multiple complexity keywords — analyze, refactor, design, implement — (+2, capped). Total 8, well above the COMPLEX threshold of 6.
The client's "model": "auto" is rewritten to the tier-resolved model (gpt-4o-mini or claude-3-opus) before the upstream call, so the provider bills for and the audit trail records the actual model, not the wildcard.
Shadow routing
Shadow routing sends a copy of production traffic to a shadow provider for comparison without affecting the primary response. The client always sees the primary response; the shadow is invoked asynchronously in the background and its output is compared for analysis.
Shadow is a property on top of any other strategy — create a round-robin route with shadow traffic via the Automation API:
Like canary-config, shadow-config blocks in dvara.llm-gateway.routes YAML are not bound at startup. Create shadow routes through POST /v1/admin/routes or the DVARA Flightdeck at /routes.
As with canary, the route ID is assigned by the server as a UUID — don't send id in the create payload, it'll be silently ignored. Capture the assigned UUID from the response's Location header for the monitoring endpoints below.
curl -i -X POST http://localhost:8090/v1/admin/routes \
-H 'Content-Type: application/json' \
-d '{
"model_pattern": "gpt*",
"strategy": "round-robin",
"shadow_config": {
"shadow_provider": "bedrock",
"sample_pct": 10,
"tenant_scope": "tenant-a",
"test_name": "bedrock-shadow"
},
"providers": [
{"provider": "openai"}
]
}'
# Response:
# HTTP/1.1 201 Created
# Location: /v1/admin/routes/<uuid>
providers lists the primary candidates only; shadow_provider does not need to be (and typically isn't) in that list. But it must be a registered provider in the gateway — credentials set, provider activated at startup. If the shadow provider isn't registered, every shadow attempt logs a WARN (Shadow provider [<name>] not found, skipping shadow dispatch) and silently skips. Total requests in the shadow report stays at zero; check the gateway logs if your report is empty.
tenant_scope for shadow
Shadow's tenant_scope filters which requests get sampled, but its behavior differs from canary in one important case:
- If
tenant_scopeis set AND the request'smetadata.tenant_idmatches → shadow may be dispatched (subject to thesample_pctroll) - If
tenant_scopeis set ANDmetadata.tenant_idis present but doesn't match → shadow is skipped - If
tenant_scopeis set ANDmetadata.tenant_idis missing → shadow IS still dispatched. (Canary's tenant_scope, by contrast, short-circuits to the baseline whentenant_idis missing.)
If you want strict tenant-scoped sampling, ensure the client always passes metadata.tenant_id on every request. Otherwise scoped shadow effectively becomes global shadow for any client that omits the field.
Monitoring
# View shadow comparison report (substitute the UUID from the create response)
curl http://localhost:8090/v1/admin/routes/{id}/shadow/report
# Reset shadow metrics
curl -X POST http://localhost:8090/v1/admin/routes/{id}/shadow/reset
The report includes total sampled request count, primary-vs-shadow average latency, and a response match rate — note that "match" here is a coarse word-level Jaccard similarity > 0.5 on the first text block, not exact equality or semantic equivalence. Two paraphrases with mostly-overlapping vocabulary count as a match; an LLM that reworded heavily but kept the meaning won't. Treat the match rate as a rough drift signal, not a regression-test pass/fail.
Example request against the shadow route:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer $DVARA_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Rewrite this paragraph in plain English."}]
}'
The client gets the primary response back (from whichever provider the round-robin strategy picked). On roughly 10% of requests (sample_pct: 10), a copy of the request is dispatched asynchronously on a virtual thread to the shadow provider (bedrock). The client never sees the shadow response — its latency, success status, and response-match score are recorded to the shadow metrics collector and surfaced on the comparison report. An outage on the shadow provider can't affect the primary path.
SLA-aware priority routing
Priority admission control is off by default. Set dvara.llm-gateway.routing.priority.enabled=true to turn it on.
Priority routing provides concurrency-based admission control with per-tier throttle thresholds. Premium tenants are admitted up to full capacity; standard and bulk tenants are throttled earlier as load rises, so high-value traffic stays responsive under pressure.
Priority tiers
| Tier | Default threshold | Behavior |
|---|---|---|
premium | 100% | Never throttled — admitted up to full capacity |
standard | 80% | Throttled when concurrent load reaches 80% of max capacity |
bulk | 50% | Throttled when concurrent load reaches 50% of max capacity |
Rejected requests return HTTP 429 with error code PRIORITY_THROTTLED.
Per-tenant configuration
Set a tenant's priority tier through the tenant metadata priority-tier key, either from the DVARA Flightdeck or the Automation API:
curl -X PUT http://localhost:8090/v1/admin/tenants/{id} \
-H 'Content-Type: application/json' \
-d '{
"name": "Acme Corp",
"status": "active",
"metadata": {"priority-tier": "premium"}
}'
Tenants without a priority-tier metadata key default to standard.
Configuration
dvara:
llm-gateway:
routing:
priority:
enabled: true # default: false
max-concurrent-requests: 1000 # default: 1000
tiers:
premium:
throttle-threshold-pct: 100
standard:
throttle-threshold-pct: 80
bulk:
throttle-threshold-pct: 50
resolver-cache-ttl-seconds: 5 # default: 5
How it works
- After policy and PII enforcement, DVARA resolves the tenant's priority tier from tenant metadata (cached for 5 seconds to avoid a repository hit on every request).
- An in-flight request counter tracks the current concurrent load. The load percentage is
(current / maxConcurrent) * 100. - If the load percentage meets or exceeds the tier's threshold, the request is rejected with
PRIORITY_THROTTLED. - Admitted requests increment the counter; it's decremented after the request completes, whether it succeeded or failed, so a stuck request never holds capacity indefinitely.
Monitoring
Priority admission has two observability surfaces — pick the one that matches the workflow.
Prometheus metrics — primary. Live dashboards, alerting, and historical trend analysis all belong on your observability stack. DVARA emits the admission counters every request:
gateway_priority_requests_total{tenant, tier}— admitted requests by tenant and tiergateway_priority_throttled_total{tenant, tier}— rejected requests by tenant and tier
Graph the ratio throttled / (admitted + throttled) per tier in Grafana to see the exact moments a tier hit its throttle threshold under load. The metrics carry per-tenant cardinality, so you can isolate a noisy tenant or cross-reference with that tenant's cost records.
REST endpoint — for scripts. GET /v1/admin/priority/stats returns a point-in-time snapshot of current concurrency and per-tier loads, useful for runbooks, capacity spot-checks, and automated oncall tooling:
curl http://localhost:8090/v1/admin/priority/stats
Response:
{
"object": "priority_stats",
"data": {
"currentConcurrent": 17,
"maxConcurrent": 1000,
"perTierConcurrent": {"premium": 5, "standard": 10, "bulk": 2},
"perTierThresholdPct": {"premium": 100, "standard": 80, "bulk": 50}
}
}
There is no dedicated Flightdeck UI for priority stats by design — the data is observability, not configuration, and it belongs on the Grafana dashboard your oncall already watches. If you land on the Flightdeck Dashboard looking for throttle rates, see the Prometheus counters above.
Hot reload
Routes in the PostgreSQL routes table hot-reload across every node in the fleet within a second or two of the change landing. Writes from the Admin API or the DVARA Flightdeck fire a config_change notification on the routes channel; every gateway node listens, re-reads routes via routeRepository.findAll(), rebuilds its in-memory route list, and swaps atomically. No restart required.
What hot-reloads:
- Route pattern, strategy, provider pool, weights, pinned model version
- Canary config on a route (split percentage, tenant scope, test name)
- Shadow config on a route (sample percentage, shadow provider)
- Tenant
priority-tiermetadata — read through a 5-second TTL cache per tenant, so a tenant-metadata update propagates within 5 seconds of the next request
What does not hot-reload (these require a restart):
- The list of activated providers (set at startup from environment variables and the credentials table)
- The DVARA license key
- Every property under
dvara.llm-gateway.routing.*inapplication.yml— EWMA smoothing factor, decay thresholds, priority admission max concurrent requests, per-tier throttle thresholds, resolver TTLs. These are bound at gateway startup and aren't refreshable without a restart. - The
dvara.llm-gateway.routes:block itself inapplication.yml— any edits to that file need a gateway restart to reach the engine, and those static routes are still replaced wholesale by theroutestable as soon as an admin write fires a config-change notification.
Update routes from the DVARA Flightdeck at /routes or through the Automation API (POST /v1/admin/routes, PUT /v1/admin/routes/{id}, DELETE /v1/admin/routes/{id}). Route versions are tracked on every save — roll back from DVARA Flightdeck if a live change causes a regression.
Next steps
- Providers — provider-specific quirks, capability matrix, and BYOK credentials for every upstream DVARA can talk to.
- Structured Outputs — how
json_schemaandjson_objectinteract with capability-aware filtering. - Resilience — circuit breakers, retries, and failover between providers on a route.
- Cost Management — the pricing table cost-aware routing reads from, plus budgets, forecasts, and chargeback reports.
- Multi-Tenancy — how per-tenant priority tiers and region context feed priority and geo-aware routing.