Rate Limiting

DVARA enforces two rate limits on every /v1/* request: a request count cap and a token budget cap. Each API key gets its own 60-second sliding-window bucket; the cap values themselves are global — every key gets the same requests-per-minute allowance. State is shared in-process across the fleet, so a request served by any pod counts against the same window — no external infrastructure required.

Rate limiting is off by default. Enable it with one of these:

# application.yml
dvara:
  llm-gateway:
    rate-limit:
      enabled: true
      per-key:
        requests-per-minute: 100       # max requests per API key per 60-second sliding window
        tokens-per-minute: 100000      # max total tokens per API key per 60-second sliding window

# Or via environment variables — Spring Boot's relaxed binding picks these up automatically
DVARA_LLM_GATEWAY_RATE_LIMIT_ENABLED=true
DVARA_LLM_GATEWAY_RATE_LIMIT_PER_KEY_REQUESTS_PER_MINUTE=100
DVARA_LLM_GATEWAY_RATE_LIMIT_PER_KEY_TOKENS_PER_MINUTE=100000

For per-tenant differentiation (free-tier vs paid-tier customers), see Limitations — the global cap values are uniform today.

How it works

Every request on /v1/* passes through the rate limiter, which buckets the request by API key (see Bucket-key resolution below for the full table — the bucket isn't always the literal Authorization: Bearer token).

Two checks run per request:

Check	Window	Config property	What it counts
Per-key request count	60 seconds	`dvara.llm-gateway.rate-limit.per-key.requests-per-minute` (default `100`)	Number of requests from this bucket
Per-key token budget	60 seconds	`dvara.llm-gateway.rate-limit.per-key.tokens-per-minute` (default `100000`)	`estimated_tokens` at admission + `total_tokens` from the response at completion

Order of operations: the request-count check fires first; if it passes, the token-budget pre-charge runs next. The first check to trip produces the 429. When both would have failed, only the first one's headers appear on the response.

The token check estimates input tokens from the messages body before dispatch and pre-charges the budget, then records the provider's actual usage.total_tokens after the response. If the actual count exceeds the pre-charge estimate, the difference is still deducted from the same window — so a request that slipped through pre-check on a low estimate can push the next request in that window over the limit.

Large-body bypass. If the request body is over 10 MB, the gateway skips token estimation for that call (parsing very large request bodies is expensive in memory) and runs only the request-count check. The body is still forwarded normally to the upstream provider; the actual usage.total_tokens from the response is still recorded, so the budget isn't free-running — just unenforced at admission for that one call.

When either check trips, the request is rejected before the upstream provider call and the gateway returns HTTP 429 Too Many Requests with headers and a structured error body.

Rate limit exceeded response

HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Retry-After-Seconds: 60
X-Trace-Id: a6783439db1f46a6bfed511a0011e955

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded",
    "trace_id": "a6783439db1f46a6bfed511a0011e955",
    "rate_limit": {
      "limited_resource": "requests",
      "limit_type": "requests_per_minute",
      "limit": 100,
      "remaining": 0,
      "retry_after_seconds": 60,
      "reset_at": "2026-04-14T10:00:00Z"
    }
  }
}

When the token budget is the one that trips, the gateway also emits two extra headers so a client can surface the exact cap without parsing the body:

X-RateLimit-Tokens-Limit: 100000
X-RateLimit-Tokens-Remaining: 2345
X-RateLimit-Reset: 2026-04-14T10:00:00Z

And error.rate_limit.limited_resource flips from requests to tokens with limit_type: tokens_per_minute. Both branches populate retry_after_seconds, reset_at, and remaining, so a client's backoff logic can work off a single field regardless of which cap hit.

Bucket-key resolution

The bucket key the rate limiter uses depends on the Authorization header and on dvara.llm-gateway.data-plane.require-api-key (env: DVARA_LLM_GATEWAY_REQUIRE_API_KEY, default false). The full matrix:

`Authorization` header state	`require-api-key=true` (production)	`require-api-key=false` (default)
Missing or malformed	HTTP 401 `api_key_required` — never reaches the rate limiter	Bucket key `anonymous` (shared across every unauth caller)
Bearer token unknown to the tenant API-key store	HTTP 401 `invalid_api_key`	Bucket key = the raw bearer token (each unique token gets its own bucket)
No tenant API-key store configured (e.g. dev mode without persistence)	bypass auth — bucket key = the raw bearer token	bypass auth — bucket key = the raw bearer token
Valid registered API key	Bucket key = the raw bearer token	Bucket key = the raw bearer token

The bucket key is always the raw bearer-token string the caller sends in the Authorization header — even on the valid-key path. The gateway does resolve the token to a tenant and stash the API-key id for audit / billing, but the rate limiter does not key off the id. One consequence: rotating an API key starts a fresh 60-second window for that key — the rotated value is a new string, so it gets a brand-new allowance. If you depend on rate-limit continuity across rotation, plan for it.

The "tenant API-key store" in the third row is the persistence layer where tenant API keys live (PostgreSQL in production). It's separate from the rate-limit cache that holds the sliding-window counters. Dev / test deployments can run without persistence — that's when the third row applies.

A few consequences worth understanding:

Production deployments should set require-api-key=true. That short-circuits the missing/invalid cases at 401 before they consume rate-limit budget. The anonymous bucket only exists in permissive mode.
The anonymous bucket is shared. Every unauth caller in a permissive deployment competes for the same 60-second window, so it's an aggregate cap, not a per-caller cap. Use it for unauthenticated smoke tests and internal health probes; gate real traffic with API keys.
Unknown tokens get their own buckets in permissive mode. A stranger sending random Bearer tokens won't be lumped into one shared bucket — each unique token gets its own 100-req/min allowance. If that's a concern, set require-api-key=true and let the gateway 401-reject unknown tokens.

Distributed state across pods

Rate limit state is shared in-process across every DVARA pod in the cluster. A single-pod deployment uses a local map; multi-pod deployments auto-cluster, so a request served by pod A and a request served by pod B count against the same sliding window without any extra configuration. Rolling restarts work without losing state — new pods join the existing cluster mid-life.

Pods discover each other based on the environment:

Local / Docker Compose — multicast discovery. No configuration required.
Kubernetes — headless-Service discovery. The gateway detects Kubernetes via the KUBERNETES_NAMESPACE environment variable (typically injected via the downward API). When that variable is set, you must also provide CACHE_SERVICE_NAME pointing at a headless Service that fronts the gateway pods. Startup fails fast if the variable is missing — the error message starts with "Kubernetes clustering requires CACHE_SERVICE_NAME when KUBERNETES_NAMESPACE is set" so it's easy to spot in the logs.

The Kubernetes Service should be headless (clusterIP: None) with publishNotReadyAddresses: true so rolling-restart pods can still discover each other while their readiness probes catch up:

env:
  - name: KUBERNETES_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: CACHE_SERVICE_NAME
    value: dvara-server-cluster

The official Helm chart wires the Service, the env vars, and the RBAC permissions automatically — see Kubernetes Deployment for the full YAML.

When rate-limiting is disabled

When dvara.llm-gateway.rate-limit.enabled is false (the default), no limits are enforced. You can keep the filter installed between dev and prod and toggle enforcement at deploy time — the runtime cost when disabled is negligible.

Configuration is static at startup

The two limits (per-key.requests-per-minute and per-key.tokens-per-minute) are read once at startup. Changing the values requires a gateway restart — there is no live-reload, and the values aren't tunable from the Console or via the Admin API. If you need to raise or lower limits on a running fleet, roll the deployment.

Limitations

A few cases worth knowing about before you tune the limiter:

No per-tenant differentiation today. Each API key gets its own bucket — keys don't share counters — but the cap values are platform-wide: every key on every tenant is governed by the same requests-per-minute and tokens-per-minute numbers. A free-tier tenant's key and a paid-tier tenant's key each get their own 100/min allowance, and you can't raise one without raising the other. If you need tenant-aware enforcement today, budget caps cover the dollar-spend dimension and can be set per-tenant or per-API-key.
Static configuration. Limit values are read once at startup; changing them requires a deploy roll. Not editable from the Console or the Admin API today.
Token pre-charge can over-spend. A request with a low input estimate that consumes far more output tokens than expected is still deducted from the same window; the over-spend pushes the next request closer to its cap. The window is sliding, not bursty — it self-corrects within 60 seconds, but a single big-output call won't be retroactively refused.
Large bodies skip the token-budget pre-check. Bodies over 10 MB skip token estimation (large-body parsing is expensive in memory); the request-count check still applies, and the upstream-reported usage.total_tokens is still recorded after the response.
The anonymous bucket is shared. Every caller with a missing or malformed Authorization header in permissive mode shares one 60-second window. See Bucket-key resolution for the full table.

Next steps

Cost Management & Budgets — per-tenant, per-API-key dollar budgets with soft and hard thresholds, auto-downgrade, and chargeback reports. This is where to go for per-tenant differentiation — the rate limiter's two global knobs are deliberately coarse.
Resilience & Failover — retry, circuit breaker, and timeout behavior for upstream provider calls. The rate limiter runs before the resilience layer, so a rate-limited request never reaches a provider and never counts against a circuit breaker.
Observability — the gateway_rate_limit_allowed_total / gateway_rate_limit_denied_total / gateway_rate_limit_errors_total Prometheus counters all carry a reason label valued request or token so you can graph denials by which check tripped.

How it works​

Rate limit exceeded response​

Bucket-key resolution​

Distributed state across pods​

When rate-limiting is disabled​

Configuration is static at startup​

Limitations​

Next steps​