Skip to main content

Resilience & Failover

Dvara wraps every provider with retry, circuit breaker, and timeout protection using Resilience4j. When a provider fails, the gateway automatically attempts fallback providers.

Retry with Exponential Backoff

Failed requests are automatically retried up to a configurable number of attempts:

gateway:
resilience:
retry:
max-attempts: 3
wait-duration-ms: 500

Retries use exponential backoff. Only transient errors (5xx, timeouts, connection failures) trigger retries — client errors (4xx) are returned immediately.

Circuit Breaker

Each provider has an independent circuit breaker that prevents cascading failures:

gateway:
resilience:
circuit-breaker:
failure-rate-threshold: 50 # percentage
sliding-window-size: 20 # number of calls
wait-duration-in-open-state-ms: 30000 # 30 seconds

State Transitions

CLOSED ──(failure rate ≥ 50%)──▶ OPEN ──(30s wait)──▶ HALF_OPEN ──(test calls)──▶ CLOSED
│ │
│ ▼
◀────────────(still failing)──────── OPEN
StateBehavior
CLOSEDNormal operation. Tracks success/failure in a sliding window.
OPENAll requests fail immediately with PROVIDER_CIRCUIT_OPEN (503). No upstream calls.
HALF_OPENA limited number of test calls are allowed through. If successful, circuit closes. If not, circuit re-opens.

Mock Provider Exclusion

The mock provider is excluded from circuit breaker and retry wrapping. Simulated errors injected via error-rate are intended for testing client-side error handling and will never trip the circuit breaker or trigger retries. The mock provider is always considered healthy.

Timeout Configuration

Separate timeouts for non-streaming and streaming requests:

gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # 30 seconds for non-streaming
streaming-timeout-ms: 120000 # 120 seconds for streaming

When a timeout is reached, the request is cancelled and the gateway proceeds with failover (if available) or returns an error.

Automatic Failover

When a provider fails (error or circuit open), the gateway automatically tries fallback providers:

  1. Primary provider fails — error, timeout, or circuit breaker open
  2. FallbackResolver identifies alternative providers that support the requested model
  3. Capability filter removes providers that don't support the request's response_format (if specified)
  4. Retry with fallback — the request is sent to the next available provider

Failover with Capability Matching

When response_format is present, fallback candidates must support the required capability:

Primary: OpenAI (supports json_schema ✓) → FAILS
Fallback candidates: [Anthropic (json_schema ✓), Ollama (json_schema ✗)]
Filtered fallbacks: [Anthropic]
→ Retry with Anthropic

If no capable fallback exists, the gateway returns HTTP 503 with:

  • Error code: failover_capability_mismatch
  • Header: X-Gateway-Failover-Blocked: capability_mismatch

Error Responses

ScenarioHTTPError CodeDescription
Provider returned error502provider_errorUpstream 4xx/5xx
Circuit breaker open503provider_circuit_openProvider temporarily unavailable
Failover blocked503failover_capability_mismatchNo capable fallback available

Full Configuration Example

gateway:
resilience:
retry:
max-attempts: 3
wait-duration-ms: 500
circuit-breaker:
failure-rate-threshold: 50
sliding-window-size: 20
wait-duration-in-open-state-ms: 30000
timeout:
chat-timeout-ms: 30000
streaming-timeout-ms: 120000