Routing & Load Balancing
Dvara routes requests to providers using configurable strategies. By default, the model field in the request determines the provider. You can override this with route definitions that distribute traffic across multiple providers.
Default Routing: Model Prefix
Without any route configuration, the gateway matches the model field against each provider's prefix:
| Model in Request | Matched Provider |
|---|---|
gpt-4o | OpenAI |
claude-sonnet-4-5 | Anthropic |
gemini-2.0-flash | Gemini |
bedrock/anthropic.claude-3-sonnet-20240229-v1:0 | Bedrock |
ollama/llama3.2 | Ollama |
mock/test-model | Mock |
Routing Strategies
| Strategy | Description |
|---|---|
model-prefix | Default. First provider whose prefix matches the model. |
round-robin | Distributes requests evenly across listed providers in order. |
weighted | Distributes requests by configured percentage weights. |
latency-aware | Enterprise. Routes to the lowest-latency healthy provider (EWMA). |
Route Configuration
Add routes under the gateway.routes key in application.yml:
gateway:
routes:
- id: load-balance-gpt
model-pattern: "gpt*"
strategy: round-robin
providers:
- provider: openai
- provider: bedrock
- id: weighted-claude
model-pattern: "claude*"
strategy: weighted
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30
- id: pin-gpt4o
model-pattern: "gpt-4o"
strategy: model-prefix
pinned-model-version: "gpt-4o-2024-08-06"
Route Fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique route identifier |
model-pattern | string | yes | Glob pattern to match model names (* suffix matches any remainder) |
strategy | string | no | model-prefix (default), round-robin, weighted, or latency-aware (enterprise) |
cost-tolerance-pct | int | no | Cost tolerance percentage for latency-aware routing (0–100, default: 0) |
providers | list | yes* | Provider pool for this route (*required for round-robin/weighted) |
providers[].provider | string | yes | Provider name: openai, anthropic, gemini, bedrock, ollama |
providers[].weight | int | no | Weight for weighted routing (default: 1) |
providers[].region | string | no | Region affinity for this provider entry (enterprise multi-region) |
pinned-model-version | string | no | Overrides the model name before sending to the provider |
How Route Matching Works
- The gateway checks each configured route in order. The first route whose
model-patternmatches the request'smodelfield is selected. - The route's strategy picks a provider from the route's provider pool.
- If no route matches, the default model-prefix strategy is used.
Round-Robin Example
Distribute GPT requests evenly between OpenAI and Bedrock:
gateway:
routes:
- id: gpt-round-robin
model-pattern: "gpt*"
strategy: round-robin
providers:
- provider: openai
- provider: bedrock
Requests cycle through the providers in order: OpenAI, Bedrock, OpenAI, Bedrock, ...
Weighted Routing Example
Send 70% of Claude traffic to Anthropic and 30% to Bedrock:
gateway:
routes:
- id: claude-weighted
model-pattern: "claude*"
strategy: weighted
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30
Model Version Pinning
Lock a model alias to a specific version:
gateway:
routes:
- id: pin-gpt4o
model-pattern: "gpt-4o"
strategy: model-prefix
pinned-model-version: "gpt-4o-2024-08-06"
When a request for gpt-4o matches this route, the gateway rewrites the model to gpt-4o-2024-08-06 before forwarding to the provider. This ensures consistent behavior even when the provider updates the default model version.
Capability-Aware Filtering
When response_format is present on a request, the gateway automatically filters the provider pool before applying the routing strategy. Only providers that support the requested format are considered:
response_format type | Required capability |
|---|---|
json_schema | supportsStructuredOutputs = true |
json_object | supportsJsonMode = true |
text / absent | No filtering applied |
This prevents routing to providers that cannot handle the request. If no provider on the route supports the requested format, the gateway returns HTTP 400 with error code no_capable_provider.
See Structured Outputs for details on capability-aware routing behavior.
Latency-Aware Routing (Enterprise)
Requires
enterprise-routingmodule + valid JWT license key (viaGATEWAY_ENTERPRISE_LICENSE_KEY).
The latency-aware strategy tracks live EWMA (Exponentially Weighted Moving Average) latency per provider/model pair and automatically routes requests to the fastest healthy provider. This is ideal for multi-provider setups where provider performance varies over time.
How It Works
- Latency recording — Every successful provider call records its response time. The tracker computes a running EWMA:
ewma = alpha * sample + (1 - alpha) * previous_ewma. - Provider selection — For each request, healthy providers are ranked by EWMA latency. The lowest-latency provider is selected.
- Cold-start exploration — Providers without enough latency samples receive ~10% of traffic (round-robin) so the system can learn their performance.
- Time decay — Stale latency entries (no samples for > 60s by default) receive a penalty to prevent routing based on outdated data.
- Fallback — If all providers lack data, the strategy falls back to round-robin.
Configuration
gateway:
routes:
- id: latency-gpt
model-pattern: "gpt*"
strategy: latency-aware
cost-tolerance-pct: 10 # optional: accept a provider up to 10% more expensive if faster
providers:
- provider: openai
weight: 1
- provider: bedrock
weight: 1
Route Fields (Latency-Aware)
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
strategy | string | yes | — | Must be latency-aware |
cost-tolerance-pct | int | no | 0 | Percentage tolerance for cost vs. latency trade-off (0–100) |
providers | list | yes | — | Provider pool to route across |
EWMA Tuning (Enterprise Properties)
| Property | Default | Description |
|---|---|---|
gateway.routing.latency.alpha | 0.2 | EWMA smoothing factor (0.0–1.0). Higher = more weight on recent samples |
gateway.routing.latency.decay-threshold-ms | 60000 | Staleness threshold in ms. Entries older than this get a decay penalty |
gateway.routing.latency.decay-multiplier | 0.5 | Stale EWMA is divided by this value (lower = harsher penalty) |
gateway.routing.latency.min-samples | 5 | Minimum samples before EWMA is used for routing decisions |
gateway.routing.latency.snapshot-interval | 100 | Persist latency snapshot to repository every N samples |
Monitoring Latency Data
Query current EWMA latency for all tracked provider+model pairs:
curl http://localhost:8080/admin/v1/latency
Response:
{
"object": "list",
"data": [
{
"provider": "openai",
"model": "gpt-4o",
"ewmaLatencyMs": 120.5,
"rawLatencyMs": 115.0,
"sampleCount": 42,
"lastUpdated": "2026-01-01T00:00:00Z"
}
]
}
SLA-Aware Priority Routing (Enterprise)
Requires
enterprise-routingmodule + valid JWT license key +gateway.routing.priority.enabled=true.
Priority routing provides concurrency-based admission control with per-tier throttle thresholds. Premium tenants are never throttled, while standard and bulk tenants are throttled earliest under heavy load.
Priority Tiers
| Tier | Default Threshold | Behavior |
|---|---|---|
premium | 100% | Never throttled (admitted up to full capacity) |
standard | 80% | Throttled when concurrent load reaches 80% of max |
bulk | 50% | Throttled when concurrent load reaches 50% of max |
Per-Tenant Configuration
Set priority tier via tenant metadata:
curl -X PUT http://localhost:8080/admin/v1/tenants/{id} \
-H 'Content-Type: application/json' \
-d '{"name": "Acme Corp", "status": "active", "metadata": {"priority-tier": "premium"}}'
Tenants without a priority-tier metadata key default to standard.
Configuration
gateway:
routing:
priority:
enabled: true # default: false
max-concurrent-requests: 1000 # default: 1000
tiers:
premium:
throttle-threshold-pct: 100
standard:
throttle-threshold-pct: 80
bulk:
throttle-threshold-pct: 50
resolver-cache-ttl-seconds: 5 # default: 5
How It Works
- After policy and PII enforcement, the gateway resolves the tenant's priority tier from
Tenant.metadata["priority-tier"]. - An
AtomicIntegertracks total in-flight requests. The load percentage is(current / maxConcurrent) * 100. - If the load percentage meets or exceeds the tier's threshold, the request is rejected with HTTP 429 (
PRIORITY_THROTTLED). - Admitted requests increment the counter; a
finallyblock decrements it after completion.
Monitoring
Query current admission stats:
curl http://localhost:8080/admin/v1/priority/stats
Response:
{
"object": "priority_stats",
"data": {
"currentConcurrent": 17,
"maxConcurrent": 1000,
"perTierConcurrent": {"premium": 5, "standard": 10, "bulk": 2},
"perTierThresholdPct": {"premium": 100, "standard": 80, "bulk": 50}
}
}
Prometheus metrics:
gateway_priority_requests_total(labels:tenant,tier) — admitted requests by tiergateway_priority_throttled_total(labels:tenant,tier) — rejected requests by tier
Hot Reload
Route configuration is stored in-memory and can be updated at runtime via the RoutingEngine.updateRoutes() API without restarting the gateway.