Resilience & Failover
Dvara wraps every provider with retry, circuit breaker, and timeout protection using Resilience4j. When a provider fails, the gateway automatically attempts fallback providers.
Retry with Exponential Backoff
Failed requests are automatically retried up to a configurable number of attempts:
gateway:
resilience:
retry:
max-attempts: 3
wait-duration-ms: 500
Retries use exponential backoff. Only transient errors (5xx, timeouts, connection failures) trigger retries — client errors (4xx) are returned immediately.
Circuit Breaker
Each provider has an independent circuit breaker that prevents cascading failures:
gateway:
resilience:
circuit-breaker:
failure-rate-threshold: 50 # percentage
sliding-window-size: 20 # number of calls
wait-duration-in-open-state-ms: 30000 # 30 seconds
State Transitions
CLOSED ──(failure rate ≥ 50%)──▶ OPEN ──(30s wait)──▶ HALF_OPEN ──(test calls)──▶ CLOSED
│ │
│ ▼
◀────────────(still failing)──────── OPEN
| State | Behavior |
|---|---|
| CLOSED | Normal operation. Tracks success/failure in a sliding window. |
| OPEN | All requests fail immediately with PROVIDER_CIRCUIT_OPEN (503). No upstream calls. |
| HALF_OPEN | A limited number of test calls are allowed through. If successful, circuit closes. If not, circuit re-opens. |
Mock Provider Exclusion
The mock provider is excluded from circuit breaker and retry wrapping. Simulated errors injected via error-rate are intended for testing client-side error handling and will never trip the circuit breaker or trigger retries. The mock provider is always considered healthy.
Timeout Configuration
Separate timeouts for non-streaming and streaming requests:
gateway:
resilience:
timeout:
chat-timeout-ms: 30000 # 30 seconds for non-streaming
streaming-timeout-ms: 120000 # 120 seconds for streaming
When a timeout is reached, the request is cancelled and the gateway proceeds with failover (if available) or returns an error.
Automatic Failover
When a provider fails (error or circuit open), the gateway automatically tries fallback providers:
- Primary provider fails — error, timeout, or circuit breaker open
- FallbackResolver identifies alternative providers that support the requested model
- Capability filter removes providers that don't support the request's
response_format(if specified) - Retry with fallback — the request is sent to the next available provider
Failover with Capability Matching
When response_format is present, fallback candidates must support the required capability:
Primary: OpenAI (supports json_schema ✓) → FAILS
Fallback candidates: [Anthropic (json_schema ✓), Ollama (json_schema ✗)]
Filtered fallbacks: [Anthropic]
→ Retry with Anthropic
If no capable fallback exists, the gateway returns HTTP 503 with:
- Error code:
failover_capability_mismatch - Header:
X-Gateway-Failover-Blocked: capability_mismatch
Error Responses
| Scenario | HTTP | Error Code | Description |
|---|---|---|---|
| Provider returned error | 502 | provider_error | Upstream 4xx/5xx |
| Circuit breaker open | 503 | provider_circuit_open | Provider temporarily unavailable |
| Failover blocked | 503 | failover_capability_mismatch | No capable fallback available |
Full Configuration Example
gateway:
resilience:
retry:
max-attempts: 3
wait-duration-ms: 500
circuit-breaker:
failure-rate-threshold: 50
sliding-window-size: 20
wait-duration-in-open-state-ms: 30000
timeout:
chat-timeout-ms: 30000
streaming-timeout-ms: 120000