Skip to main content

Resilience & Failover

DVARA wraps every provider with retry, circuit breaker, and timeout protection. When a provider fails, the gateway automatically attempts fallback providers.

Configuration is static at startup

Every value on this page is read once at boot from application.yml (or the matching environment variables). Changing any value requires a deploy roll — there is no Console UI, no Admin API, and no live-reload via the configuration-refresh pipeline that powers routes and policies. The values are baked into each provider's wrappers when the gateway starts and stay there for the life of the pod.

Retry with Exponential Backoff

Failed requests are automatically retried up to a configurable number of attempts. The wait between attempts starts at initial-backoff-ms, multiplies by backoff-multiplier on each retry, and is capped at max-backoff-ms:

dvara:
llm-gateway:
resilience:
retry:
max-attempts: 3 # total attempts including the first try
initial-backoff-ms: 500 # wait before the first retry
backoff-multiplier: 2.0 # delay grows: 500ms → 1s → 2s → …
max-backoff-ms: 10000 # capped so a long chain can't explode

Or via environment variables:

DVARA_LLM_GATEWAY_RESILIENCE_RETRY_MAX_ATTEMPTS=3
DVARA_LLM_GATEWAY_RESILIENCE_RETRY_INITIAL_BACKOFF_MS=500
DVARA_LLM_GATEWAY_RESILIENCE_RETRY_BACKOFF_MULTIPLIER=2.0
DVARA_LLM_GATEWAY_RESILIENCE_RETRY_MAX_BACKOFF_MS=10000

Only provider errors (upstream 5xx, timeouts, connection failures — surfaced as PROVIDER_ERROR at the gateway) trigger a retry. Gateway-internal errors like NO_PROVIDER, INVALID_REQUEST, POLICY_DENIED, and every other client-error code are returned immediately without retry.

Circuit Breaker

Each provider has an independent circuit breaker that prevents cascading failures:

dvara:
llm-gateway:
resilience:
circuit-breaker:
failure-rate-threshold: 50 # percentage of failing calls in the window that opens the breaker
sliding-window-size: 10 # number of calls the breaker averages over
minimum-number-of-calls: 5 # minimum calls in the window before the breaker considers tripping — suppresses early-startup noise
wait-duration-in-open-state-ms: 30000 # time the breaker stays OPEN before probing upstream again
permitted-calls-in-half-open: 3 # number of probe calls allowed in HALF_OPEN — if all succeed, the breaker closes

Or via environment variables:

DVARA_LLM_GATEWAY_RESILIENCE_CIRCUIT_BREAKER_FAILURE_RATE_THRESHOLD=50
DVARA_LLM_GATEWAY_RESILIENCE_CIRCUIT_BREAKER_SLIDING_WINDOW_SIZE=10
DVARA_LLM_GATEWAY_RESILIENCE_CIRCUIT_BREAKER_MINIMUM_NUMBER_OF_CALLS=5
DVARA_LLM_GATEWAY_RESILIENCE_CIRCUIT_BREAKER_WAIT_DURATION_IN_OPEN_STATE_MS=30000
DVARA_LLM_GATEWAY_RESILIENCE_CIRCUIT_BREAKER_PERMITTED_CALLS_IN_HALF_OPEN=3

State Transitions

CLOSED ──(failure rate ≥ 50%)──▶ OPEN ──(30s wait)──▶ HALF_OPEN ──(test calls)──▶ CLOSED
│ │
│ ▼
◀────────────(still failing)──────── OPEN
StateBehavior
CLOSEDNormal operation. Tracks success/failure in a sliding window.
OPENAll requests fail immediately with PROVIDER_CIRCUIT_OPEN (503). No upstream calls.
HALF_OPENA limited number of test calls are allowed through. If successful, circuit closes. If not, circuit re-opens.

Mock Provider Exclusion

The mock provider is excluded from circuit breaker and retry wrapping. Simulated errors injected via error-rate are intended for testing client-side error handling and will never trip the circuit breaker or trigger retries. The mock provider is always considered healthy.

Timeout Configuration

Separate timeouts for non-streaming and streaming requests:

dvara:
llm-gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # 30 seconds for non-streaming
streaming-timeout-ms: 120000 # 120 seconds for streaming

Or via environment variables:

DVARA_LLM_GATEWAY_RESILIENCE_TIMEOUT_CHAT_TIMEOUT_MS=30000
DVARA_LLM_GATEWAY_RESILIENCE_TIMEOUT_STREAMING_TIMEOUT_MS=120000

When a timeout is reached, the request is cancelled and the gateway proceeds with failover (if available) or returns an error.

Streaming and the resilience layer

The retry, circuit-breaker, timeout, and fallback machinery wraps the connection-establishment phase of a streaming call only. Once the SSE stream starts emitting chunks, the gateway is in pass-through mode and an error mid-stream is forwarded to the client as-is. The boundary matters when you tune for streaming:

  • Retry covers the initial streaming POST. If the connection fails before the first chunk lands, retry kicks in. If a chunk lands and then the upstream closes the connection or sends an error event mid-stream, that error is forwarded to the SSE consumer — retry does not engage.
  • Circuit breaker counts the initial connection success or failure. A clean connection that later breaks mid-stream still counts as a success for the breaker (the upstream call returned an iterator; what happened after isn't tracked).
  • Fallback is only attempted if the initial connection fails. Once chunks are flowing, falling back would mean re-issuing the prompt to a different provider and concatenating two partial responses — semantically ambiguous, so the gateway doesn't try. The broken stream is surfaced to the client.
  • Timeout (streaming-timeout-ms, 120s default) bounds the connection-establishment phase. Once the first chunk has been sent, the timeout doesn't fire mid-stream; total session duration is whatever the upstream provider allows.

The practical tuning implication: size streaming-timeout-ms for the maximum reasonable connection-establishment latency, not for the total streaming session duration. If your upstream is slow to first-token (large prompts, heavy reasoning models), bump this; if you want to fail fast on a dead connection, tighten it.

Automatic Failover

When a provider fails (error or circuit open), the gateway automatically tries fallback providers:

  1. Primary provider fails — error, timeout, or circuit breaker open
  2. Candidate resolution — the gateway picks alternative providers on the same route that support the requested model
  3. Capability filter — providers that can't satisfy the request's response_format (if specified) are dropped from the candidate list
  4. Retry with fallback — the request is sent to the next available capable candidate

Unhealthy candidates are skipped silently. Before dispatching to a fallback, the gateway checks that candidate's own circuit breaker via the shared provider-health registry. If the candidate's circuit is currently OPEN, the gateway skips past it and tries the next one — it doesn't probe a provider that's already known to be failing. So a route like [OpenAI, Anthropic, Gemini] where Anthropic's circuit is open from prior failures routes an OpenAI failure straight to Gemini, never bothering Anthropic. This avoids cascading retries where every fallback attempt is itself blocked. If every capable fallback is unhealthy, the original primary error is re-thrown to the client.

Failover with Capability Matching

When response_format is present, fallback candidates must support the required capability:

Primary: OpenAI (supports json_schema ✓) → FAILS
Fallback candidates: [Anthropic (json_schema ✓), Ollama (json_schema ✗)]
Filtered fallbacks: [Anthropic]
→ Retry with Anthropic

If no capable fallback exists, the gateway returns HTTP 503 with:

  • Error code: failover_capability_mismatch
  • Header: X-Gateway-Failover-Blocked: capability_mismatch

Disabling failover

Automatic failover is on by default. To keep the resilience wrappers (retry, circuit breaker, timeouts) but never attempt a fallback on failure, set:

dvara:
llm-gateway:
resilience:
fallback:
enabled: false

Or via environment variable:

DVARA_LLM_GATEWAY_RESILIENCE_FALLBACK_ENABLED=false

With fallback.enabled: false, a failed primary provider returns the error straight to the client — no alternative provider is tried.

Per-provider overrides

Every field on retry, circuit-breaker, and timeout can be overridden per provider under dvara.llm-gateway.resilience.providers.<name>. Anything you don't override inherits the top-level default. Use this to give a slow upstream a longer chat timeout, or a flaky upstream a more permissive circuit breaker, without changing the baseline for the rest of the pool:

dvara:
llm-gateway:
resilience:
retry:
max-attempts: 3
initial-backoff-ms: 500
circuit-breaker:
failure-rate-threshold: 50
sliding-window-size: 10

providers:
bedrock:
timeout:
chat-timeout-ms: 60000 # Bedrock can be slow on first call from a cold SigV4 signer
ollama:
retry:
max-attempts: 1 # no point retrying a local Ollama — fail fast to the fallback
circuit-breaker:
failure-rate-threshold: 80 # Ollama flaps more in dev, don't trip the breaker as eagerly

The override keys are the same provider names the Providers page lists — openai, anthropic, gemini, bedrock, azure-openai, mistral, cohere, groq, qwen, deepseek, moonshot, chatglm, grok, ollama, mock. An unrecognized provider name is silently ignored at startup, so double-check spelling if your override appears not to apply.

Per-provider overrides require a config file, not environment variables

Spring Boot's environment-variable binding can't hydrate a Map<String, ...> of nested objects, so DVARA_LLM_GATEWAY_RESILIENCE_PROVIDERS_OPENAI_TIMEOUT_CHAT_TIMEOUT_MS does not work — it parses but is silently dropped. Use application.yml for per-provider overrides, or pass the structured config via SPRING_APPLICATION_JSON:

SPRING_APPLICATION_JSON='{"dvara":{"llm-gateway":{"resilience":{"providers":{"openai":{"timeout":{"chatTimeoutMs":60000}}}}}}}'

The top-level keys (retry.*, circuit-breaker.*, timeout.*, fallback.enabled) all support direct env-var binding as shown in the per-section examples above.

Error Responses

When the resilience layer surfaces a failure to the client — upstream provider error, open circuit breaker, or a failover blocked because no capable fallback exists — the response shape and the full HTTP-status / error-code mapping are documented once on Error Handling. The same page covers retry-after semantics, the X-Gateway-Failover-Blocked header, and example error bodies.

Full configuration example

Every resilience knob on the gateway, with defaults matching what you get out of the box — safe to paste into application.yml without changing anything and use it as a starting point for tuning:

dvara:
llm-gateway:
resilience:
enabled: true # master switch; set false only in tests
retry:
max-attempts: 3
initial-backoff-ms: 500
backoff-multiplier: 2.0
max-backoff-ms: 10000
circuit-breaker:
failure-rate-threshold: 50
sliding-window-size: 10
minimum-number-of-calls: 5
wait-duration-in-open-state-ms: 30000
permitted-calls-in-half-open: 3
timeout:
chat-timeout-ms: 30000
streaming-timeout-ms: 120000
fallback:
enabled: true
providers: {} # per-provider overrides, see "Per-provider overrides" above