Response Caching
DVARA is the AI governance platform for LLM and MCP traffic. Response caching is one of the cost-control mechanisms inside that platform: it short-circuits identical or near-identical prompts before they reach an upstream provider, so repeat traffic doesn't pay the same per-token cost twice. DVARA requires a valid license key at startup — every feature on this page, including caching, is part of the licensed platform.
Backends
| Backend | Use case | Activation |
|---|---|---|
| Exact-match (default) | Repeat traffic with identical prompts | dvara.llm-gateway.cache.enabled=true |
| Semantic | Repeat traffic with paraphrased or rearranged prompts that should still hit | dvara.llm-gateway.cache.enabled=true + dvara.llm-gateway.cache.semantic.enabled=true |
Only one backend is active at a time. When semantic caching is enabled it replaces the exact-match backend; the gateway either hashes the request for exact-match lookup or embeds it for similarity lookup, never both.
Both backends are shared across pods. Cached entries live in the same embedded distributed cache that carries rate-limit counters and API-key lookups, so a response cached by pod A is visible to pod B on the next call. No external cache infrastructure is required — pods auto-cluster across the fleet (multicast in dev, headless-Service discovery in Kubernetes; see Rate Limiting for the discovery setup).
Enabling Caching
Set dvara.llm-gateway.cache.enabled=true in application.yml:
dvara:
llm-gateway:
cache:
enabled: true
ttl-seconds: 3600 # cache entry time-to-live (default: 3600)
max-size: 10000 # per-node max entries before LRU eviction (default: 10000)
Or via environment variables — Spring Boot's relaxed binding picks these up automatically:
DVARA_LLM_GATEWAY_CACHE_ENABLED=true
DVARA_LLM_GATEWAY_CACHE_TTL_SECONDS=3600
DVARA_LLM_GATEWAY_CACHE_MAX_SIZE=10000
When caching is disabled (default), a no-op cache is injected. All requests pass through to providers with zero overhead.
Cache Key Derivation
The cache key is a SHA-256 hash of the canonical request:
Included in the hashed string: model, each message's role, each message's extracted text content (in order), temperature, maxTokens.
Excluded from the hashed string:
stream— streaming requests bypass the cache entirely before the key is computedmetadata— transient request-routing hints, not part of the cache identityresponseFormat,tools,toolChoice— not currently included in the key. Two otherwise-identical requests with differentresponse_formatvalues hash to the same key and can share a cache entry. If this matters for your workload, disable caching per-request withX-Cache-Control: no-cache.- Non-text content blocks — an
ImageBlockis replaced with the placeholder[image:<mediaType>]in the hashed string, so two requests with different images of the same MIME type hash to the same key. Multi-modal callers that rely on cache correctness should sendX-Cache-Control: no-cache.
PII stripping before cache lookup and write
Before the cache is consulted, the request is passed through DVARA's PII enforcer, which rewrites any PII tokens in the request text into tenant-specific stable placeholders. The stripped form is then used for both the cache lookup and the cache write. Two requests that differ only in their PII values therefore resolve to the same cache key and share a hit, regardless of whether the tenant's pii.action is LOG, REDACT, or any other non-BLOCK mode. (BLOCK short-circuits the request before it reaches the cache at all.)
This means PII-heavy workloads — the canonical "summarize this support ticket" pattern where hundreds of tickets differ only in identifying fields — collapse onto a single cache entry by default, with no extra knob to flip.
Per-tenant PII configuration controls which entity types get tokenized and how. See PII Detection for the enforcer configuration; the cache uses the same policies with no separate knob.
How It Works
- A non-streaming request arrives at
POST /v1/chat/completions. - If the
X-Cache-Control: no-cacheheader is present, the cache lookup is skipped. - Otherwise, the gateway looks up the request in the cache.
- Cache hit: the cached response is returned immediately with
X-Cache: HIT. No provider call is made. - Cache miss: the request is forwarded to the provider. The response is stored in the cache and returned with
X-Cache: MISS. - Streaming requests (
"stream": true) bypass the cache entirely — no lookup, no storage, noX-Cacheheader.
Response Headers
| Header | Value | When |
|---|---|---|
X-Cache | HIT | Response served from cache |
X-Cache | MISS | Response fetched from provider, now cached |
| (absent) | — | Streaming request (cache bypassed) |
Bypassing the Cache
Send the X-Cache-Control: no-cache header to force a fresh provider call:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Cache-Control: no-cache" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What time is it?"}]
}'
The response will still be stored in the cache (with X-Cache: MISS), so subsequent requests without the bypass header will hit the cache.
Cache Metering
Cache hits are logged at INFO level with the number of tokens saved:
Cache HIT for key=a1b2c3d4e5f6 — tokens saved: 33
Token usage from cached responses is still recorded for rate-limiting purposes.
Semantic Cache
The semantic cache ships in 1.0.0 with a hash-based embedding that catches close paraphrases (re-orderings, added pleasantries, minor word swaps) but is not semantically rich. Two sentences that mean the same thing but share few characters — "summarize this paragraph" vs. "give me the gist" — will usually miss. A neural embedding model (and a real similarity index) ships in 1.1.0.
If you turn this on for 1.0.0, validate hit rate against your actual traffic with the stats endpoint before relying on it for cost savings, and keep the default default-threshold: 0.92 (low precision, low recall) until you've measured. Tenant isolation is enforced — see Vector store — so a hit for tenant A is bounded to tenant A's cache, but the hit-rate ceiling is governed by the embedding quality and that's the part that's preview-grade.
When the semantic cache is enabled, it replaces the exact-match backend for chat-completion caching. Instead of hashing the request into an exact-match key, the semantic cache embeds the request text and performs a cosine-similarity search against previously-cached embeddings, scoped to the tenant id from the request's resolved API key. A hit fires when the nearest neighbor's similarity exceeds the configured threshold.
Why fuzzy matching matters
Exact-match caching only hits when two requests produce the same bytes. Chat traffic rarely does — users rephrase, rearrange, add a pleasantry, or shift a word. A semantic cache aims to treat "summarize this paragraph" and "can you give me a short summary of this text" as the same query if the embeddings land close enough in vector space.
In 1.0.0 the embedding is a deterministic character-trigram hash, not a neural model. It works for surface paraphrases that share substrings; it doesn't work for sentences that mean the same thing but use different words. Treat 1.0.0 hit rates as a lower bound for what 1.1.0 will deliver, not as a representative measurement.
Configuration
dvara:
llm-gateway:
cache:
semantic:
enabled: true # default: true
default-threshold: 0.92 # cosine similarity threshold; lower = more aggressive matching
max-entries: 10000 # per-tenant cap before oldest-first eviction
ttl-seconds: 3600 # per-entry TTL; same semantics as exact-match cache
drift:
check-schedule: "" # cron expression for drift detection; blank = disabled
default-threshold: 0.85 # drift detection threshold (separate from match threshold)
Or via environment variables:
DVARA_LLM_GATEWAY_CACHE_SEMANTIC_ENABLED=true
DVARA_LLM_GATEWAY_CACHE_SEMANTIC_DEFAULT_THRESHOLD=0.92
DVARA_LLM_GATEWAY_CACHE_SEMANTIC_MAX_ENTRIES=10000
DVARA_LLM_GATEWAY_CACHE_SEMANTIC_TTL_SECONDS=3600
DVARA_LLM_GATEWAY_CACHE_SEMANTIC_DRIFT_CHECK_SCHEDULE=""
DVARA_LLM_GATEWAY_CACHE_SEMANTIC_DRIFT_DEFAULT_THRESHOLD=0.85
Per-tenant thresholds
The match threshold is overridable per tenant through the Admin API (GET/POST/PUT/DELETE /v1/admin/cache/configs) or the DVARA Flightdeck. A tenant that prefers precision over hit rate can set a higher threshold (say 0.96) while another tenant chasing cost savings can lower theirs (say 0.85). Threshold resolution walks: tenant-scoped config matching the model pattern → platform-global config matching the model pattern → default-threshold.
Vector store
Embeddings and cached response bodies live in one distributed map per tenant, named dvara-semantic-cache:{tenantId} (the platform-default tenant uses dvara-semantic-cache:_platform_). The store is shared across pods through the same embedded distributed cluster that carries rate-limit counters and API-key lookups, so a hit cached by pod A is visible to pod B on the next call. Tenant isolation is enforced by the per-tenant map name — there is no path for tenant A's lookup to consider tenant B's entries. The store evicts the oldest entry past max-entries and drops entries past ttl-seconds.
Search is brute-force linear cosine over the tenant's map. Acceptable while per-tenant entry counts stay under ~10K; an indexed (HNSW) implementation lands in 1.1.0 alongside the neural embedding model.
Stats
The semantic cache tracks live hit count, miss count, and cumulative similarity sum across all tenants. The admin endpoint GET /v1/admin/cache/stats returns a JSON object with the following shape (Jackson camelCase, no snake_case strategy):
{
"totalHits": 1247,
"totalMisses": 3891,
"hitRate": 0.243,
"avgSimilarity": 0.94,
"cacheSize": 5138,
"computedAt": "2026-06-05T18:42:11.227Z"
}
Use it to validate that the default threshold is right for your traffic before rolling out tenant-specific overrides. cacheSize is the count of stored entries across all per-tenant maps at the snapshot moment; computedAt is the snapshot timestamp so you can correlate against per-tenant traffic changes.
For per-tenant observability, two Prometheus counters are emitted on every lookup:
gateway_semantic_cache_hits_total{tenant}— incremented on each cache hitgateway_semantic_cache_misses_total{tenant}— incremented on each cache miss (including paths that bypass the embed step)
The tenant label is the resolved tenant id, or _platform_ for non-tenant-scoped requests (admin tooling, internal probes). Graph hit rate per tenant in Prometheus to spot tenants whose threshold needs tuning.
Distributed cache for API-key lookups
Multi-instance deployments share API-key lookups through an embedded distributed cache that runs inside the gateway process — no external cache infrastructure required. Pods auto-cluster across the fleet, so an API key resolved once on pod A is sub-millisecond on pod B too. Reads check the distributed map first and fall back to PostgreSQL; writes update PostgreSQL and immediately evict the cached entry across the cluster so revoked or rotated keys are never served stale.
The same distributed cache also carries per-key rate-limit counters. Discovery, headless-Service setup, and the CACHE_SERVICE_NAME environment variable are documented once on Rate Limiting.