Rate Limiting
DVARA enforces two rate limits on every /v1/* request: a request count cap and a token budget cap. Each API key gets its own 60-second sliding-window bucket; the cap values themselves are global — every key gets the same requests-per-minute allowance. State is shared in-process across the fleet, so a request served by any pod counts against the same window — no external infrastructure required.
Rate limiting is off by default. Enable it with one of these:
# application.yml
dvara:
llm-gateway:
rate-limit:
enabled: true
per-key:
requests-per-minute: 100 # max requests per API key per 60-second sliding window
tokens-per-minute: 100000 # max total tokens per API key per 60-second sliding window
# Or via environment variables — Spring Boot's relaxed binding picks these up automatically
DVARA_LLM_GATEWAY_RATE_LIMIT_ENABLED=true
DVARA_LLM_GATEWAY_RATE_LIMIT_PER_KEY_REQUESTS_PER_MINUTE=100
DVARA_LLM_GATEWAY_RATE_LIMIT_PER_KEY_TOKENS_PER_MINUTE=100000
For per-tenant differentiation (free-tier vs paid-tier customers), see Limitations — the global cap values are uniform today.
How it works
Every request on /v1/* passes through the rate limiter, which buckets the request by API key (see Bucket-key resolution below for the full table — the bucket isn't always the literal Authorization: Bearer token).
Two checks run per request:
| Check | Window | Config property | What it counts |
|---|---|---|---|
| Per-key request count | 60 seconds | dvara.llm-gateway.rate-limit.per-key.requests-per-minute (default 100) | Number of requests from this bucket |
| Per-key token budget | 60 seconds | dvara.llm-gateway.rate-limit.per-key.tokens-per-minute (default 100000) | estimated_tokens at admission + total_tokens from the response at completion |
Order of operations: the request-count check fires first; if it passes, the token-budget pre-charge runs next. The first check to trip produces the 429. When both would have failed, only the first one's headers appear on the response.
The token check estimates input tokens from the messages body before dispatch and pre-charges the budget, then records the provider's actual usage.total_tokens after the response. If the actual count exceeds the pre-charge estimate, the difference is still deducted from the same window — so a request that slipped through pre-check on a low estimate can push the next request in that window over the limit.
Large-body bypass. If the request body is over 10 MB, the gateway skips token estimation for that call (parsing very large request bodies is expensive in memory) and runs only the request-count check. The body is still forwarded normally to the upstream provider; the actual usage.total_tokens from the response is still recorded, so the budget isn't free-running — just unenforced at admission for that one call.
When either check trips, the request is rejected before the upstream provider call and the gateway returns HTTP 429 Too Many Requests with headers and a structured error body.
Rate limit exceeded response
HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Retry-After-Seconds: 60
X-Trace-Id: a6783439db1f46a6bfed511a0011e955
{
"error": {
"message": "Rate limit exceeded",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"trace_id": "a6783439db1f46a6bfed511a0011e955",
"rate_limit": {
"limited_resource": "requests",
"limit_type": "requests_per_minute",
"limit": 100,
"remaining": 0,
"retry_after_seconds": 60,
"reset_at": "2026-04-14T10:00:00Z"
}
}
}
When the token budget is the one that trips, the gateway also emits two extra headers so a client can surface the exact cap without parsing the body:
X-RateLimit-Tokens-Limit: 100000
X-RateLimit-Tokens-Remaining: 2345
X-RateLimit-Reset: 2026-04-14T10:00:00Z
And error.rate_limit.limited_resource flips from requests to tokens with limit_type: tokens_per_minute. Both branches populate retry_after_seconds, reset_at, and remaining, so a client's backoff logic can work off a single field regardless of which cap hit.
Bucket-key resolution
The bucket key the rate limiter uses depends on the Authorization header and on dvara.llm-gateway.data-plane.require-api-key (env: DVARA_LLM_GATEWAY_REQUIRE_API_KEY, default false). The full matrix:
Authorization header state | require-api-key=true (production) | require-api-key=false (default) |
|---|---|---|
| Missing or malformed | HTTP 401 api_key_required — never reaches the rate limiter | Bucket key anonymous (shared across every unauth caller) |
| Bearer token unknown to the tenant API-key store | HTTP 401 invalid_api_key | Bucket key = the raw bearer token (each unique token gets its own bucket) |
| No tenant API-key store configured (e.g. dev mode without persistence) | bypass auth — bucket key = the raw bearer token | bypass auth — bucket key = the raw bearer token |
| Valid registered API key | Bucket key = the raw bearer token | Bucket key = the raw bearer token |
The bucket key is always the raw bearer-token string the caller sends in the Authorization header — even on the valid-key path. The gateway does resolve the token to a tenant and stash the API-key id for audit / billing, but the rate limiter does not key off the id. One consequence: rotating an API key starts a fresh 60-second window for that key — the rotated value is a new string, so it gets a brand-new allowance. If you depend on rate-limit continuity across rotation, plan for it.
The "tenant API-key store" in the third row is the persistence layer where tenant API keys live (PostgreSQL in production). It's separate from the rate-limit cache that holds the sliding-window counters. Dev / test deployments can run without persistence — that's when the third row applies.
A few consequences worth understanding:
- Production deployments should set
require-api-key=true. That short-circuits the missing/invalid cases at 401 before they consume rate-limit budget. Theanonymousbucket only exists in permissive mode. - The
anonymousbucket is shared. Every unauth caller in a permissive deployment competes for the same 60-second window, so it's an aggregate cap, not a per-caller cap. Use it for unauthenticated smoke tests and internal health probes; gate real traffic with API keys. - Unknown tokens get their own buckets in permissive mode. A stranger sending random Bearer tokens won't be lumped into one shared bucket — each unique token gets its own 100-req/min allowance. If that's a concern, set
require-api-key=trueand let the gateway 401-reject unknown tokens.
Distributed state across pods
Rate limit state is shared in-process across every DVARA pod in the cluster. A single-pod deployment uses a local map; multi-pod deployments auto-cluster, so a request served by pod A and a request served by pod B count against the same sliding window without any extra configuration. Rolling restarts work without losing state — new pods join the existing cluster mid-life.
Pods discover each other based on the environment:
- Local / Docker Compose — multicast discovery. No configuration required.
- Kubernetes — headless-Service discovery. The gateway detects Kubernetes via the
KUBERNETES_NAMESPACEenvironment variable (typically injected via the downward API). When that variable is set, you must also provideCACHE_SERVICE_NAMEpointing at a headless Service that fronts the gateway pods. Startup fails fast if the variable is missing — the error message starts with"Kubernetes clustering requires CACHE_SERVICE_NAME when KUBERNETES_NAMESPACE is set"so it's easy to spot in the logs.
The Kubernetes Service should be headless (clusterIP: None) with publishNotReadyAddresses: true so rolling-restart pods can still discover each other while their readiness probes catch up:
env:
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CACHE_SERVICE_NAME
value: dvara-server-cluster
The official Helm chart wires the Service, the env vars, and the RBAC permissions automatically — see Kubernetes Deployment for the full YAML.
When rate-limiting is disabled
When dvara.llm-gateway.rate-limit.enabled is false (the default), no limits are enforced. You can keep the filter installed between dev and prod and toggle enforcement at deploy time — the runtime cost when disabled is negligible.
Configuration is static at startup
The two limits (per-key.requests-per-minute and per-key.tokens-per-minute) are read once at startup. Changing the values requires a gateway restart — there is no live-reload, and the values aren't tunable from the Console or via the Admin API. If you need to raise or lower limits on a running fleet, roll the deployment.
Limitations
A few cases worth knowing about before you tune the limiter:
- No per-tenant differentiation today. Each API key gets its own bucket — keys don't share counters — but the cap values are platform-wide: every key on every tenant is governed by the same
requests-per-minuteandtokens-per-minutenumbers. A free-tier tenant's key and a paid-tier tenant's key each get their own 100/min allowance, and you can't raise one without raising the other. If you need tenant-aware enforcement today, budget caps cover the dollar-spend dimension and can be set per-tenant or per-API-key. - Static configuration. Limit values are read once at startup; changing them requires a deploy roll. Not editable from the Console or the Admin API today.
- Token pre-charge can over-spend. A request with a low input estimate that consumes far more output tokens than expected is still deducted from the same window; the over-spend pushes the next request closer to its cap. The window is sliding, not bursty — it self-corrects within 60 seconds, but a single big-output call won't be retroactively refused.
- Large bodies skip the token-budget pre-check. Bodies over 10 MB skip token estimation (large-body parsing is expensive in memory); the request-count check still applies, and the upstream-reported
usage.total_tokensis still recorded after the response. - The
anonymousbucket is shared. Every caller with a missing or malformedAuthorizationheader in permissive mode shares one 60-second window. See Bucket-key resolution for the full table.
Next steps
- Cost Management & Budgets — per-tenant, per-API-key dollar budgets with soft and hard thresholds, auto-downgrade, and chargeback reports. This is where to go for per-tenant differentiation — the rate limiter's two global knobs are deliberately coarse.
- Resilience & Failover — retry, circuit breaker, and timeout behavior for upstream provider calls. The rate limiter runs before the resilience layer, so a rate-limited request never reaches a provider and never counts against a circuit breaker.
- Observability — the
gateway_rate_limit_allowed_total/gateway_rate_limit_denied_total/gateway_rate_limit_errors_totalPrometheus counters all carry areasonlabel valuedrequestortokenso you can graph denials by which check tripped.