ML & Plugin Guardrails
DVARA's regex-based guardrails catch the obvious stuff (profanity, well-known prompt-injection patterns, PII). For harder detections — novel jailbreaks, subtle manipulation, domain-specific policy violations — DVARA ships two extensibility layers:
- ML classifier integration — call out to a commercial classifier (Lakera, Google ShieldGemini, or a generic endpoint) for injection / jailbreak scoring
- External guardrail plugins — call any HTTPS endpoint you run yourself with an HMAC-signed payload
ML classifier integration
The ML classifier runs as part of the gateway's request-side guardrail pipeline. When enabled, it sends the request text to an external classifier, waits for a score, and applies the configured action (BLOCK / FLAG / LOG) based on whether the score exceeds the confidence threshold.
Supported providers
| Provider | Value for dvara.llm-gateway.guardrail.ml-classifier.provider | Default endpoint |
|---|---|---|
| Generic | generic | (must be set explicitly) |
| Lakera Guard | lakera | https://api.lakera.ai/v1/guard |
| Google ShieldGemini | shield-gemini | derived from project-id + location |
Configuration
dvara.llm-gateway.guardrail.ml-classifier.enabled=true
dvara.llm-gateway.guardrail.ml-classifier.provider=lakera
dvara.llm-gateway.guardrail.ml-classifier.api-key=${LAKERA_API_KEY}
dvara.llm-gateway.guardrail.ml-classifier.confidence-threshold=0.8
dvara.llm-gateway.guardrail.ml-classifier.timeout-seconds=5
dvara.llm-gateway.guardrail.ml-classifier.cache-max-size=1000
dvara.llm-gateway.guardrail.ml-classifier.cache-ttl-seconds=300
| Property | Default | Description |
|---|---|---|
enabled | false | Master switch |
provider | generic | generic, lakera, or shield-gemini |
endpoint | auto | Auto-defaulted for Lakera / ShieldGemini. Set explicitly for generic. |
api-key | — | Vendor API key (env: LAKERA_API_KEY or GOOGLE_API_KEY) |
project-id | — | GCP project ID (ShieldGemini only) |
location | us-central1 | GCP region (ShieldGemini only) |
confidence-threshold | 0.8 | Detections below this score are ignored |
timeout-seconds | 5 | HTTP call timeout |
cache-max-size | 1000 | LRU cache max entries (0 disables the cache) |
cache-ttl-seconds | 300 | Cache entry TTL |
Lakera Guard
Lakera's Guard API scores prompts for prompt injection, jailbreak attempts, PII, and toxic content. DVARA flattens the prompt to a single text field ({"input": "<flattened-text>"}) — not the full messages array — then extracts results[0].flagged and picks the highest-confidence flagged category from results[0].category_scores.
Category mapping is two-way only. DVARA maps Lakera's jailbreak → JAILBREAK and prompt_injection → INJECTION. Every other Lakera category (PII, toxic content, anything not in the explicit map) falls through to INJECTION in DVARA's audit events — so if you need per-category telemetry beyond injection vs jailbreak, add a separate guardrail plugin that surfaces those distinctions explicitly.
dvara-llm-proxy:
environment:
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "lakera"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${LAKERA_API_KEY}"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_CONFIDENCE_THRESHOLD: "0.85"
The env var DVARA actually reads is DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY — the same name regardless of provider. Vendor-convention names like LAKERA_API_KEY and GOOGLE_API_KEY are not auto-mapped; if you keep your vendor key in a LAKERA_API_KEY env var, indirect through it as shown above so Compose / Kubernetes substitutes it at deploy time.
Google ShieldGemini
ShieldGemini is Google's Vertex AI-hosted safety classifier. DVARA derives the endpoint from project-id and location:
dvara-llm-proxy:
environment:
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "shield-gemini"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROJECT_ID: "my-gcp-project"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_LOCATION: "us-central1"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${GOOGLE_API_KEY}"
Same env-var rule as Lakera: the auth key must land in DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY. A bare GOOGLE_API_KEY is not picked up — indirect through it as shown.
Caching
ML classifier calls add latency (typically 50–200 ms). DVARA caches results by the SHA-256 of the flattened request text with an LRU of cache-max-size entries and a TTL of cache-ttl-seconds. Repeat prompts hit the cache and skip the classifier call. (Two requests that differ only in their messages framing but flatten to the same text will share a cache entry — for compliance setups that need every request scored, disable the cache.)
Set cache-max-size=0 to disable caching entirely (useful if you want every request scored for compliance reasons).
Metrics
gateway_ml_guardrail_total— Counter, labels:provider,category,action
When to use it
- You're already paying for Lakera or ShieldGemini and want a second layer beyond regex patterns
- Your threat model includes novel jailbreaks that regex can't catch
- Compliance requires an independent safety assessment
If your regex-based guardrails are already catching what you need, adding an ML classifier is optional.
Self-hosted open-source alternatives
You don't need a Lakera or ShieldGemini contract to use the ML hook. The provider: generic path will call any HTTP endpoint that accepts POST {"text": "..."} and returns {"label": "...", "confidence": 0.x}, so any of the following can be deployed as a sidecar and slotted in without changing gateway code.
| Service | License | What it detects | How to point DVARA at it |
|---|---|---|---|
| ProtectAI LLM Guard | Apache 2.0 | Prompt injection, jailbreak, toxicity, anonymization, code detection (composite Python service) | Deploy on Kubernetes with the published Helm chart; set dvara.llm-gateway.guardrail.ml-classifier.endpoint to its /scan endpoint. |
| NVIDIA NeMo Guardrails | Apache 2.0 | Programmable rails — injection / jailbreak / topic boundaries via Colang policies | Run the NeMo server, configure a rail for prompt injection, point DVARA at the resulting REST endpoint. |
| Microsoft Presidio | MIT | NER-based PII detection | Already integrated via the dedicated PII Detection configuration path; mentioned here for completeness. |
For a deployment that wants a real ML classifier but rejects every SaaS dependency, the practical shape is:
-
Pull a small open-weight model — for example
protectai/deberta-v3-base-prompt-injection-v2(~150 MB ONNX, Apache 2.0, ~98% F1 on the prompt-injection benchmark) ordeepset/deberta-v3-base-injection-v2(~95% F1). -
Serve it via HuggingFace TEI, vLLM, or NVIDIA Triton (all Apache 2.0 / BSD), or via a thin Python wrapper using
transformers+fastapi. -
Set:
dvara:llm-gateway:guardrail:ml-classifier:enabled: trueprovider: genericendpoint: http://injection-classifier.internal:8080/classifyconfidence-threshold: 0.85
The gateway treats the response identically to a Lakera or ShieldGemini response.
External guardrail plugins
Plugins are the "everything else" path: any HTTPS endpoint you run yourself, called over HTTP with an HMAC-signed payload. Use this when you want to integrate your own ML model, a domain-specific rules engine, or a third-party vendor that isn't natively supported.
Plugins run inside the request-side guardrail pipeline, after the built-in guardrails and in parallel with the ML classifier. Each plugin is registered with a unique name, a URL, an HMAC secret, a timeout, and a fail mode that determines what happens when the plugin is unavailable.
Master switch
Turn the plugin subsystem on with a single property:
dvara:
llm-gateway:
guardrail:
plugins:
enabled: true
With the switch off (the default), plugin definitions are ignored even if they exist in the database.
Managing plugin definitions
Plugin definitions live in the DVARA control plane, not in a YAML file. Three surfaces, picked by audience:
- DVARA Flightdeck Console at
/guardrail-plugins— platformownerrole. The list view shows tenant-scoped and platform-global definitions together; an optional?tenant_id=filter narrows to one tenant. Operators create, edit, rotate, and delete both global and per-tenant plugins from here. - DVARA Flightdeck tenant portal at
/portal/guardrail-plugins— tenantadminanddeveloperroles. Scoped to the caller's own tenant — global plugins are not shown here even though they still apply at evaluation time. - Automation API under
/v1/admin/guardrail-plugins— any deployment automation; reaches the same rows the two UIs do.
Create a plugin:
curl -X POST http://localhost:8090/v1/admin/guardrail-plugins \
-H "Authorization: Bearer $DVARA_PAT" \
-H "Content-Type: application/json" \
-d '{
"name": "internal-jailbreak-classifier",
"url": "https://safety.internal.example.com/classify",
"secret": "<hmac-signing-secret>",
"timeout_ms": 3000,
"fail_mode": "OPEN",
"enabled": true,
"tenant_id": null
}'
Set tenant_id to a tenant ID to scope the plugin to that tenant only; leave it null for a platform-global plugin. Tenant-scoped plugins shadow platform-global plugins with the same name.
| Endpoint | Purpose |
|---|---|
POST /v1/admin/guardrail-plugins | Create a plugin definition (secret encrypted at rest) |
GET /v1/admin/guardrail-plugins | List definitions (optional ?tenant_id= filter; secrets masked) |
GET /v1/admin/guardrail-plugins/{id} | Fetch one definition |
PUT /v1/admin/guardrail-plugins/{id} | Update URL, timeout, fail mode, enabled flag |
POST /v1/admin/guardrail-plugins/{id}/rotate-secret | Rotate the HMAC signing secret |
DELETE /v1/admin/guardrail-plugins/{id} | Delete a definition |
POST /v1/admin/guardrail-plugins/{id}/test | Send a synthetic request to this definition and return latency + detections |
Changes propagate to every running gateway instance within a few seconds; no restart is required.
Runtime registry snapshot. Two read-only endpoints reflect the in-memory registry used by the request path, which can be useful when debugging:
| Endpoint | Purpose |
|---|---|
GET /v1/admin/guardrail/plugins | List plugins currently live in the request path with their availability status |
POST /v1/admin/guardrail/plugins/{name}/test | Test a live plugin by name (not by ID) |
Fail modes
OPEN— if the plugin HTTP call times out or returns an error, the gateway treats the call as producing no detections and the request proceeds. Use this for optional / advisory plugins.CLOSED— if the plugin fails, the gateway rejects the request. In the current release this surfaces as HTTP500withtype: guardrail_plugin_errorandcode: guardrail_plugin_error; match on thecodefield rather than the status if you need to distinguish plugin failure from other 500-class errors. UseCLOSEDfor compliance-critical plugins where a failing check must stop the request.
Start with OPEN for every new plugin. Only promote to CLOSED once you're confident in the plugin's uptime.
Request and response contract
DVARA sends the following payload to the plugin URL:
POST /classify HTTP/1.1
Content-Type: application/json
X-Gateway-Signature: sha256=<hex-hmac>
{
"text": "The concatenated prompt text the gateway wants scored",
"tenant_id": "acme-corp",
"config": {
"strict": true
}
}
text— the full scannable prompt text for this request, already flattened from the chat messages.tenant_id— the calling tenant, empty string when the request is untenanted.config— free-form per-tenant plugin config passed through from tenant metadata. Omitted when empty.X-Gateway-Signature— HMAC-SHA256 of the raw request body using the configured signing secret, prefixed withsha256=. Your plugin must verify this header and reject any request whose signature does not match.
Expected response (HTTP 200):
{
"detections": [
{
"category": "JAILBREAK",
"label": "jailbreak-attempt",
"matched_text": "ignore previous instructions",
"risk_score": 0.92,
"rule_id": "jailbreak-v3"
}
]
}
category— one of the built-in guardrail categories (INJECTION,JAILBREAK,PROFANITY,VIOLENCE,SEXUAL,COMPETITOR_MENTION,TOPIC_RESTRICTION,CONTENT_POLICY,HALLUCINATION,CUSTOM). Unknown values fall back toCUSTOM.label— short human-readable label, surfaced in audit events. Defaults to the plugin name.matched_text— the fragment that triggered the detection, truncated to 100 characters if longer. Defaults to a truncation of the request text.risk_score— detection confidence in[0.0, 1.0]. Defaults to0.8.rule_id— stable identifier for this rule inside your plugin. Defaults to the plugin name.
Return an empty detections array when nothing is flagged. A malformed 200 response is treated as an empty result even in CLOSED mode — the HTTP call succeeded, so it is not a transport failure; add schema tests on your plugin side if this matters.
Plugins do not choose their own enforcement action. The action applied when a detection fires is the tenant's configured guardrail action (BLOCK / FLAG / LOG), not something the plugin returns. If you need a single plugin to block while other plugins only log, split the traffic across two tenants with different guardrail.action metadata.
Per-tenant plugin overrides
Tenants can enable, disable, or pass config to individual plugins by setting the guardrail.plugins key in tenant metadata:
{
"guardrail.plugins": {
"internal-jailbreak-classifier": { "enabled": true, "strict": true },
"legal-policy-check": { "enabled": false }
}
}
Any extra keys under a plugin entry are forwarded to the plugin under the config field in the request body.
Metrics
gateway_plugin_guardrail_total— Counter, labels:plugin(name),category,action
Error code
guardrail_plugin_error— raised only whenfail_mode: CLOSEDand the plugin is unavailable or returns a non-200. Surfaces as HTTP500withtype: guardrail_plugin_error.
Related
- Guardrails overview — regex-based PII, injection, and content guardrails
- Hallucination Detection — output-side grounding checks
- Observability — guardrail metrics dashboards