Skip to main content

Hallucination Detection

When a response is supposed to be grounded in a set of source documents — a RAG answer, a summary, a citation — DVARA can catch claims that aren't actually supported by the sources. The detector splits the response into sentence-level claims, embeds each claim and each source passage, and flags any claim whose maximum cosine similarity to any source falls below a configurable threshold.

Grounding runs as a response-side check on non-streaming calls and as an at-stream-end check on SSE responses.

How it works

The detector is fail-open — if embedding fails, the filter passes the response through and writes a warning log. It never blocks on its own failures.

The sentence splitter applies a 5-word minimum per claim — sentences shorter than 5 words (after whitespace split) are filtered out and never embedded. So one-word answers like "Yes." / "False." / "It is." or short list bullets are skipped silently and the response is treated as grounded. The detector is designed for long-form responses; if your workload needs grounding on short-token outputs, gate it differently (e.g. structured-output schemas with required citation fields).

Supplying source documents

The detector looks for sources in request.metadata["grounding.sources"] as a list of strings. Your client supplies them on each chat-completion request:

POST /v1/chat/completions
{
"model": "gpt-4o",
"messages": [
{ "role": "system", "content": "Answer based only on the provided context." },
{ "role": "user", "content": "What is DVARA's auth model?" }
],
"metadata": {
"grounding.sources": [
"DVARA supports three auth modes: built-in, OIDC, and SAML 2.0.",
"The data plane always requires API keys."
]
}
}

If metadata.grounding.sources is missing or empty and grounding is enabled, the detector is a no-op — it doesn't fabricate sources for you.

Sources above dvara.llm-gateway.guardrail.grounding.max-source-length (default 10000 chars) are silently dropped. Lists above dvara.llm-gateway.guardrail.grounding.max-sources (default 50) have their tail dropped. This protects you from runaway embedding bills on pathological inputs.

Configuration

Set these in application.yml:

dvara:
llm-gateway:
guardrail:
grounding:
enabled: true
similarity-threshold: 0.7 # cosine threshold; claims below this are ungrounded
action: LOG # LOG, FLAG, or BLOCK
max-sources: 50 # max source documents per request
max-source-length: 10000 # max chars per source document

Or via environment variables — the gateway maps dvara.llm-gateway.guardrail.grounding.* YAML keys to DVARA_LLM_GATEWAY_GUARDRAIL_GROUNDING_* env vars automatically:

DVARA_LLM_GATEWAY_GUARDRAIL_GROUNDING_ENABLED=true
DVARA_LLM_GATEWAY_GUARDRAIL_GROUNDING_SIMILARITY_THRESHOLD=0.7
DVARA_LLM_GATEWAY_GUARDRAIL_GROUNDING_ACTION=LOG
DVARA_LLM_GATEWAY_GUARDRAIL_GROUNDING_MAX_SOURCES=50
DVARA_LLM_GATEWAY_GUARDRAIL_GROUNDING_MAX_SOURCE_LENGTH=10000
PropertyDefaultDescription
enabledfalseEnable embedding-based grounding detection
similarity-threshold0.7Cosine similarity threshold — claims below this are ungrounded
actionLOGLOG, FLAG, or BLOCK
max-sources50Maximum source documents per request (excess silently dropped)
max-source-length10000Maximum chars per source document (oversized silently dropped)

Actions

  • LOG (default) — write a HALLUCINATION_DETECTED audit event and a Prometheus metric sample, then pass the response through unchanged. Use this in shadow mode while you tune the threshold.
  • FLAG — behaviorally identical to LOG in the current release: the audit event and the metric sample are written, and the response is passed through unchanged. The value of setting FLAG is that the action label on the gateway_grounding_check_total counter reflects your intent, so dashboards can distinguish "running in shadow" from "running in warn-before-block" when you're staging a rollout. There is no X-Gateway-Grounding-Warning response header in this release — do not rely on a client-side header to drive UX.
  • BLOCK — the original response bytes are discarded and the client receives a structured error (see Error code below for the wire format).

Start with LOG, measure the false-positive rate on real traffic, then promote to FLAG or BLOCK once you're confident.

Per-tenant overrides

The DVARA Flightdeck exposes per-tenant grounding settings on the Grounding tab of the tenant edit form (Identity → Tenants → Edit). The form covers enabled, action, max-sources, and max-source-length with server-side validation. Updates emit a TENANT_METADATA_UPDATED audit event with a diff of every changed key.

For Terraform, CI/CD, and other programmatic tooling, the same overrides go through the Automation API as tenant metadata:

curl -X PUT http://localhost:8090/v1/admin/tenants/acme-corp \
-H "Content-Type: application/json" \
-d '{
"metadata": {
"grounding.enabled": "true",
"grounding.action": "BLOCK",
"grounding.max-sources": "20",
"grounding.max-source-length": "5000"
}
}'
Metadata keyTypeOverrides
grounding.enabledbooleanEnable / disable per tenant
grounding.actionBLOCK / FLAG / LOGAction for this tenant
grounding.max-sourcesintPer-tenant source cap
grounding.max-source-lengthintPer-tenant source length cap

Missing keys fall back to the global config.

similarity-threshold is not overridable per tenant — it's bound at startup and applies to every tenant uniformly. Changing it requires a gateway restart. If different tenants need different strictness, use per-tenant grounding.action (e.g. BLOCK for the strict tenant, LOG for the rest) rather than reaching for a per-tenant threshold.

Streaming responses

For streaming (stream=true) responses, the gateway buffers the full response text across all SSE chunks. At stream end, the grounding check runs against the accumulated text. The action determines what happens:

  • BLOCK — the stream is terminated with a final chunk containing finishReason: content_filter and a HALLUCINATION_DETECTED_STREAMING audit event is written. The client sees a graceful stream termination.
  • LOG / FLAG — a HALLUCINATION_DETECTED_STREAMING audit event is written and the stream completes normally.

Because streaming grounding checks happen at stream end, they don't add latency mid-response — you get full streaming throughput.

Observability

The grounding detector emits a single Prometheus counter and two audit-event types. Hook these into your existing dashboards before turning the action up from LOG to BLOCK.

Prometheus metric

MetricTypeLabelsDescription
gateway_grounding_check_totalCountergrounded (true/false), action (LOG/FLAG/BLOCK)Incremented once per non-streaming response that runs through the detector.

PromQL to dashboard the ungrounded rate — the percentage of responses where the detector flagged at least one ungrounded claim:

sum(rate(gateway_grounding_check_total{grounded="false"}[5m]))
/
sum(rate(gateway_grounding_check_total[5m]))

PromQL to alert on regressions — fire when ungrounded rate exceeds 10% over a 15-minute window:

sum(rate(gateway_grounding_check_total{grounded="false"}[15m]))
/
sum(rate(gateway_grounding_check_total[15m]))
> 0.1

PromQL to track the rollout from LOGFLAGBLOCK — a stacked breakdown by action:

sum by (action) (rate(gateway_grounding_check_total[5m]))

Audit events

An audit event is written only when the detector flags at least one ungrounded claim — fully grounded responses do not emit an event. Two event types are emitted depending on the response mode:

  • HALLUCINATION_DETECTED — non-streaming responses.
  • HALLUCINATION_DETECTED_STREAMING — streaming responses, written at stream end.

Envelope fields (set by the audit writer): event ID, timestamp, tenantId, and the event type. The trace ID and request metadata live on the surrounding HTTP access log line, not on the audit payload.

Payload fields:

FieldTypeDescription
groundedbooleanAlways false on these events — the event is only written when the check fails.
confidencedoubleDetector's confidence in the grounding decision (0.0–1.0).
overall_similaritydoubleAggregate cosine similarity across all claims.
ungrounded_claim_countintNumber of sentence claims that fell below the similarity threshold.
ungrounded_claimslist of stringsThe flagged claim text for every claim that failed the check. Truncated to 100 characters with a ... suffix when the original is longer — reviewers seeing the suffix should cross-reference the original response to read the full sentence.
sourcestringPresent only on HALLUCINATION_DETECTED_STREAMING, with value streaming_response. Distinguishes the streaming origin in log queries.

Error code

When action=BLOCK and the detector finds an ungrounded claim, the gateway returns HTTP 403 with type: guardrail_violation and code: hallucination_detected — the same envelope shape as the other guardrail enforcement codes (guardrail_blocked, pii_detected).

{
"error": {
"type": "guardrail_violation",
"code": "hallucination_detected",
"message": "Response blocked: hallucination detected (3 ungrounded claims)",
"trace_id": "a6783439db1f46a6bfed511a0011e955"
}
}

Tuning the threshold

The default threshold (0.7) is a conservative starting point. Things to keep in mind:

  • Claim tokenizer quality matters more than the embedding model. A bad sentence split can attribute unrelated content to a claim.
  • Source passage length affects similarity scores — very long passages dilute similarity. Break sources into semantically coherent chunks (1–3 sentences each).
  • Embedding model affects absolute similarity values. The 1.0.0 release uses the same in-process embedding service as the semantic cache — a deterministic hash-based embedding that catches close paraphrases but isn't semantically rich. A neural embedding model is on the roadmap; switching to it will require retuning the threshold from scratch. Treat 1.0.0 false-positive rates as a lower bound for what a neural model will deliver.
  • False positives are most common on responses that paraphrase sources heavily or on lists where each bullet is short. Raise the threshold cautiously — below 0.5 is usually too permissive.