Skip to main content

Routing & Load Balancing

Dvara routes requests to providers using configurable strategies. By default, the model field in the request determines the provider. You can override this with route definitions that distribute traffic across multiple providers.

Default Routing: Model Prefix

Without any route configuration, the gateway matches the model field against each provider's prefix:

Model in RequestMatched Provider
gpt-4oOpenAI
claude-sonnet-4-5Anthropic
gemini-2.0-flashGemini
bedrock/anthropic.claude-3-sonnet-20240229-v1:0Bedrock
ollama/llama3.2Ollama
mock/test-modelMock

Routing Strategies

StrategyDescription
model-prefixDefault. First provider whose prefix matches the model.
round-robinDistributes requests evenly across listed providers in order.
weightedDistributes requests by configured percentage weights.
latency-awareEnterprise. Routes to the lowest-latency healthy provider (EWMA).

Route Configuration

Add routes under the gateway.routes key in application.yml:

gateway:
routes:
- id: load-balance-gpt
model-pattern: "gpt*"
strategy: round-robin
providers:
- provider: openai
- provider: bedrock

- id: weighted-claude
model-pattern: "claude*"
strategy: weighted
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30

- id: pin-gpt4o
model-pattern: "gpt-4o"
strategy: model-prefix
pinned-model-version: "gpt-4o-2024-08-06"

Route Fields

FieldTypeRequiredDescription
idstringyesUnique route identifier
model-patternstringyesGlob pattern to match model names (* suffix matches any remainder)
strategystringnomodel-prefix (default), round-robin, weighted, or latency-aware (enterprise)
cost-tolerance-pctintnoCost tolerance percentage for latency-aware routing (0–100, default: 0)
providerslistyes*Provider pool for this route (*required for round-robin/weighted)
providers[].providerstringyesProvider name: openai, anthropic, gemini, bedrock, ollama
providers[].weightintnoWeight for weighted routing (default: 1)
providers[].regionstringnoRegion affinity for this provider entry (enterprise multi-region)
pinned-model-versionstringnoOverrides the model name before sending to the provider

How Route Matching Works

  1. The gateway checks each configured route in order. The first route whose model-pattern matches the request's model field is selected.
  2. The route's strategy picks a provider from the route's provider pool.
  3. If no route matches, the default model-prefix strategy is used.

Round-Robin Example

Distribute GPT requests evenly between OpenAI and Bedrock:

gateway:
routes:
- id: gpt-round-robin
model-pattern: "gpt*"
strategy: round-robin
providers:
- provider: openai
- provider: bedrock

Requests cycle through the providers in order: OpenAI, Bedrock, OpenAI, Bedrock, ...

Weighted Routing Example

Send 70% of Claude traffic to Anthropic and 30% to Bedrock:

gateway:
routes:
- id: claude-weighted
model-pattern: "claude*"
strategy: weighted
providers:
- provider: anthropic
weight: 70
- provider: bedrock
weight: 30

Model Version Pinning

Lock a model alias to a specific version:

gateway:
routes:
- id: pin-gpt4o
model-pattern: "gpt-4o"
strategy: model-prefix
pinned-model-version: "gpt-4o-2024-08-06"

When a request for gpt-4o matches this route, the gateway rewrites the model to gpt-4o-2024-08-06 before forwarding to the provider. This ensures consistent behavior even when the provider updates the default model version.

Capability-Aware Filtering

When response_format is present on a request, the gateway automatically filters the provider pool before applying the routing strategy. Only providers that support the requested format are considered:

response_format typeRequired capability
json_schemasupportsStructuredOutputs = true
json_objectsupportsJsonMode = true
text / absentNo filtering applied

This prevents routing to providers that cannot handle the request. If no provider on the route supports the requested format, the gateway returns HTTP 400 with error code no_capable_provider.

See Structured Outputs for details on capability-aware routing behavior.

Latency-Aware Routing (Enterprise)

Requires enterprise-routing module + valid JWT license key (via GATEWAY_ENTERPRISE_LICENSE_KEY).

The latency-aware strategy tracks live EWMA (Exponentially Weighted Moving Average) latency per provider/model pair and automatically routes requests to the fastest healthy provider. This is ideal for multi-provider setups where provider performance varies over time.

How It Works

  1. Latency recording — Every successful provider call records its response time. The tracker computes a running EWMA: ewma = alpha * sample + (1 - alpha) * previous_ewma.
  2. Provider selection — For each request, healthy providers are ranked by EWMA latency. The lowest-latency provider is selected.
  3. Cold-start exploration — Providers without enough latency samples receive ~10% of traffic (round-robin) so the system can learn their performance.
  4. Time decay — Stale latency entries (no samples for > 60s by default) receive a penalty to prevent routing based on outdated data.
  5. Fallback — If all providers lack data, the strategy falls back to round-robin.

Configuration

gateway:
routes:
- id: latency-gpt
model-pattern: "gpt*"
strategy: latency-aware
cost-tolerance-pct: 10 # optional: accept a provider up to 10% more expensive if faster
providers:
- provider: openai
weight: 1
- provider: bedrock
weight: 1

Route Fields (Latency-Aware)

FieldTypeRequiredDefaultDescription
strategystringyesMust be latency-aware
cost-tolerance-pctintno0Percentage tolerance for cost vs. latency trade-off (0–100)
providerslistyesProvider pool to route across

EWMA Tuning (Enterprise Properties)

PropertyDefaultDescription
gateway.routing.latency.alpha0.2EWMA smoothing factor (0.0–1.0). Higher = more weight on recent samples
gateway.routing.latency.decay-threshold-ms60000Staleness threshold in ms. Entries older than this get a decay penalty
gateway.routing.latency.decay-multiplier0.5Stale EWMA is divided by this value (lower = harsher penalty)
gateway.routing.latency.min-samples5Minimum samples before EWMA is used for routing decisions
gateway.routing.latency.snapshot-interval100Persist latency snapshot to repository every N samples

Monitoring Latency Data

Query current EWMA latency for all tracked provider+model pairs:

curl http://localhost:8080/admin/v1/latency

Response:

{
"object": "list",
"data": [
{
"provider": "openai",
"model": "gpt-4o",
"ewmaLatencyMs": 120.5,
"rawLatencyMs": 115.0,
"sampleCount": 42,
"lastUpdated": "2026-01-01T00:00:00Z"
}
]
}

SLA-Aware Priority Routing (Enterprise)

Requires enterprise-routing module + valid JWT license key + gateway.routing.priority.enabled=true.

Priority routing provides concurrency-based admission control with per-tier throttle thresholds. Premium tenants are never throttled, while standard and bulk tenants are throttled earliest under heavy load.

Priority Tiers

TierDefault ThresholdBehavior
premium100%Never throttled (admitted up to full capacity)
standard80%Throttled when concurrent load reaches 80% of max
bulk50%Throttled when concurrent load reaches 50% of max

Per-Tenant Configuration

Set priority tier via tenant metadata:

curl -X PUT http://localhost:8080/admin/v1/tenants/{id} \
-H 'Content-Type: application/json' \
-d '{"name": "Acme Corp", "status": "active", "metadata": {"priority-tier": "premium"}}'

Tenants without a priority-tier metadata key default to standard.

Configuration

gateway:
routing:
priority:
enabled: true # default: false
max-concurrent-requests: 1000 # default: 1000
tiers:
premium:
throttle-threshold-pct: 100
standard:
throttle-threshold-pct: 80
bulk:
throttle-threshold-pct: 50
resolver-cache-ttl-seconds: 5 # default: 5

How It Works

  1. After policy and PII enforcement, the gateway resolves the tenant's priority tier from Tenant.metadata["priority-tier"].
  2. An AtomicInteger tracks total in-flight requests. The load percentage is (current / maxConcurrent) * 100.
  3. If the load percentage meets or exceeds the tier's threshold, the request is rejected with HTTP 429 (PRIORITY_THROTTLED).
  4. Admitted requests increment the counter; a finally block decrements it after completion.

Monitoring

Query current admission stats:

curl http://localhost:8080/admin/v1/priority/stats

Response:

{
"object": "priority_stats",
"data": {
"currentConcurrent": 17,
"maxConcurrent": 1000,
"perTierConcurrent": {"premium": 5, "standard": 10, "bulk": 2},
"perTierThresholdPct": {"premium": 100, "standard": 80, "bulk": 50}
}
}

Prometheus metrics:

  • gateway_priority_requests_total (labels: tenant, tier) — admitted requests by tier
  • gateway_priority_throttled_total (labels: tenant, tier) — rejected requests by tier

Hot Reload

Route configuration is stored in-memory and can be updated at runtime via the RoutingEngine.updateRoutes() API without restarting the gateway.