Skip to main content

ML & Plugin Guardrails

DVARA's regex-based guardrails catch the obvious stuff (profanity, well-known prompt-injection patterns, PII). For harder detections — novel jailbreaks, subtle manipulation, domain-specific policy violations — DVARA ships two extensibility layers:

  1. ML classifier integration — call out to a commercial classifier (Lakera, Google ShieldGemini, or a generic endpoint) for injection / jailbreak scoring
  2. External guardrail plugins — call any HTTPS endpoint you run yourself with an HMAC-signed payload

ML classifier integration

The ML classifier runs as part of the gateway's request-side guardrail pipeline. When enabled, it sends the request text to an external classifier, waits for a score, and applies the configured action (BLOCK / FLAG / LOG) based on whether the score exceeds the confidence threshold.

Supported providers

ProviderValue for dvara.llm-gateway.guardrail.ml-classifier.providerDefault endpoint
Genericgeneric(must be set explicitly)
Lakera Guardlakerahttps://api.lakera.ai/v1/guard
Google ShieldGeminishield-geminiderived from project-id + location

Configuration

dvara.llm-gateway.guardrail.ml-classifier.enabled=true
dvara.llm-gateway.guardrail.ml-classifier.provider=lakera
dvara.llm-gateway.guardrail.ml-classifier.api-key=${LAKERA_API_KEY}
dvara.llm-gateway.guardrail.ml-classifier.confidence-threshold=0.8
dvara.llm-gateway.guardrail.ml-classifier.timeout-seconds=5
dvara.llm-gateway.guardrail.ml-classifier.cache-max-size=1000
dvara.llm-gateway.guardrail.ml-classifier.cache-ttl-seconds=300
PropertyDefaultDescription
enabledfalseMaster switch
providergenericgeneric, lakera, or shield-gemini
endpointautoAuto-defaulted for Lakera / ShieldGemini. Set explicitly for generic.
api-keyVendor API key (env: LAKERA_API_KEY or GOOGLE_API_KEY)
project-idGCP project ID (ShieldGemini only)
locationus-central1GCP region (ShieldGemini only)
confidence-threshold0.8Detections below this score are ignored
timeout-seconds5HTTP call timeout
cache-max-size1000LRU cache max entries (0 disables the cache)
cache-ttl-seconds300Cache entry TTL

Lakera Guard

Lakera's Guard API scores prompts for prompt injection, jailbreak attempts, PII, and toxic content. DVARA flattens the prompt to a single text field ({"input": "<flattened-text>"}) — not the full messages array — then extracts results[0].flagged and picks the highest-confidence flagged category from results[0].category_scores.

Category mapping is two-way only. DVARA maps Lakera's jailbreakJAILBREAK and prompt_injectionINJECTION. Every other Lakera category (PII, toxic content, anything not in the explicit map) falls through to INJECTION in DVARA's audit events — so if you need per-category telemetry beyond injection vs jailbreak, add a separate guardrail plugin that surfaces those distinctions explicitly.

dvara-llm-proxy:
environment:
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "lakera"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${LAKERA_API_KEY}"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_CONFIDENCE_THRESHOLD: "0.85"

The env var DVARA actually reads is DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY — the same name regardless of provider. Vendor-convention names like LAKERA_API_KEY and GOOGLE_API_KEY are not auto-mapped; if you keep your vendor key in a LAKERA_API_KEY env var, indirect through it as shown above so Compose / Kubernetes substitutes it at deploy time.

Google ShieldGemini

ShieldGemini is Google's Vertex AI-hosted safety classifier. DVARA derives the endpoint from project-id and location:

dvara-llm-proxy:
environment:
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "shield-gemini"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROJECT_ID: "my-gcp-project"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_LOCATION: "us-central1"
DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${GOOGLE_API_KEY}"

Same env-var rule as Lakera: the auth key must land in DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY. A bare GOOGLE_API_KEY is not picked up — indirect through it as shown.

Caching

ML classifier calls add latency (typically 50–200 ms). DVARA caches results by the SHA-256 of the flattened request text with an LRU of cache-max-size entries and a TTL of cache-ttl-seconds. Repeat prompts hit the cache and skip the classifier call. (Two requests that differ only in their messages framing but flatten to the same text will share a cache entry — for compliance setups that need every request scored, disable the cache.)

Set cache-max-size=0 to disable caching entirely (useful if you want every request scored for compliance reasons).

Metrics

  • gateway_ml_guardrail_total — Counter, labels: provider, category, action

When to use it

  • You're already paying for Lakera or ShieldGemini and want a second layer beyond regex patterns
  • Your threat model includes novel jailbreaks that regex can't catch
  • Compliance requires an independent safety assessment

If your regex-based guardrails are already catching what you need, adding an ML classifier is optional.

Self-hosted open-source alternatives

You don't need a Lakera or ShieldGemini contract to use the ML hook. The provider: generic path will call any HTTP endpoint that accepts POST {"text": "..."} and returns {"label": "...", "confidence": 0.x}, so any of the following can be deployed as a sidecar and slotted in without changing gateway code.

ServiceLicenseWhat it detectsHow to point DVARA at it
ProtectAI LLM GuardApache 2.0Prompt injection, jailbreak, toxicity, anonymization, code detection (composite Python service)Deploy on Kubernetes with the published Helm chart; set dvara.llm-gateway.guardrail.ml-classifier.endpoint to its /scan endpoint.
NVIDIA NeMo GuardrailsApache 2.0Programmable rails — injection / jailbreak / topic boundaries via Colang policiesRun the NeMo server, configure a rail for prompt injection, point DVARA at the resulting REST endpoint.
Microsoft PresidioMITNER-based PII detectionAlready integrated via the dedicated PII Detection configuration path; mentioned here for completeness.

For a deployment that wants a real ML classifier but rejects every SaaS dependency, the practical shape is:

  1. Pull a small open-weight model — for example protectai/deberta-v3-base-prompt-injection-v2 (~150 MB ONNX, Apache 2.0, ~98% F1 on the prompt-injection benchmark) or deepset/deberta-v3-base-injection-v2 (~95% F1).

  2. Serve it via HuggingFace TEI, vLLM, or NVIDIA Triton (all Apache 2.0 / BSD), or via a thin Python wrapper using transformers + fastapi.

  3. Set:

    dvara:
    llm-gateway:
    guardrail:
    ml-classifier:
    enabled: true
    provider: generic
    endpoint: http://injection-classifier.internal:8080/classify
    confidence-threshold: 0.85

The gateway treats the response identically to a Lakera or ShieldGemini response.


External guardrail plugins

Plugins are the "everything else" path: any HTTPS endpoint you run yourself, called over HTTP with an HMAC-signed payload. Use this when you want to integrate your own ML model, a domain-specific rules engine, or a third-party vendor that isn't natively supported.

Plugins run inside the request-side guardrail pipeline, after the built-in guardrails and in parallel with the ML classifier. Each plugin is registered with a unique name, a URL, an HMAC secret, a timeout, and a fail mode that determines what happens when the plugin is unavailable.

Master switch

Turn the plugin subsystem on with a single property:

dvara:
llm-gateway:
guardrail:
plugins:
enabled: true

With the switch off (the default), plugin definitions are ignored even if they exist in the database.

Managing plugin definitions

Plugin definitions live in the DVARA control plane, not in a YAML file. Three surfaces, picked by audience:

  • DVARA Flightdeck Console at /guardrail-plugins — platform owner role. The list view shows tenant-scoped and platform-global definitions together; an optional ?tenant_id= filter narrows to one tenant. Operators create, edit, rotate, and delete both global and per-tenant plugins from here.
  • DVARA Flightdeck tenant portal at /portal/guardrail-plugins — tenant admin and developer roles. Scoped to the caller's own tenant — global plugins are not shown here even though they still apply at evaluation time.
  • Automation API under /v1/admin/guardrail-plugins — any deployment automation; reaches the same rows the two UIs do.

Create a plugin:

curl -X POST http://localhost:8090/v1/admin/guardrail-plugins \
-H "Authorization: Bearer $DVARA_PAT" \
-H "Content-Type: application/json" \
-d '{
"name": "internal-jailbreak-classifier",
"url": "https://safety.internal.example.com/classify",
"secret": "<hmac-signing-secret>",
"timeout_ms": 3000,
"fail_mode": "OPEN",
"enabled": true,
"tenant_id": null
}'

Set tenant_id to a tenant ID to scope the plugin to that tenant only; leave it null for a platform-global plugin. Tenant-scoped plugins shadow platform-global plugins with the same name.

EndpointPurpose
POST /v1/admin/guardrail-pluginsCreate a plugin definition (secret encrypted at rest)
GET /v1/admin/guardrail-pluginsList definitions (optional ?tenant_id= filter; secrets masked)
GET /v1/admin/guardrail-plugins/{id}Fetch one definition
PUT /v1/admin/guardrail-plugins/{id}Update URL, timeout, fail mode, enabled flag
POST /v1/admin/guardrail-plugins/{id}/rotate-secretRotate the HMAC signing secret
DELETE /v1/admin/guardrail-plugins/{id}Delete a definition
POST /v1/admin/guardrail-plugins/{id}/testSend a synthetic request to this definition and return latency + detections

Changes propagate to every running gateway instance within a few seconds; no restart is required.

Runtime registry snapshot. Two read-only endpoints reflect the in-memory registry used by the request path, which can be useful when debugging:

EndpointPurpose
GET /v1/admin/guardrail/pluginsList plugins currently live in the request path with their availability status
POST /v1/admin/guardrail/plugins/{name}/testTest a live plugin by name (not by ID)

Fail modes

  • OPEN — if the plugin HTTP call times out or returns an error, the gateway treats the call as producing no detections and the request proceeds. Use this for optional / advisory plugins.
  • CLOSED — if the plugin fails, the gateway rejects the request. In the current release this surfaces as HTTP 500 with type: guardrail_plugin_error and code: guardrail_plugin_error; match on the code field rather than the status if you need to distinguish plugin failure from other 500-class errors. Use CLOSED for compliance-critical plugins where a failing check must stop the request.

Start with OPEN for every new plugin. Only promote to CLOSED once you're confident in the plugin's uptime.

Request and response contract

DVARA sends the following payload to the plugin URL:

POST /classify HTTP/1.1
Content-Type: application/json
X-Gateway-Signature: sha256=<hex-hmac>

{
"text": "The concatenated prompt text the gateway wants scored",
"tenant_id": "acme-corp",
"config": {
"strict": true
}
}
  • text — the full scannable prompt text for this request, already flattened from the chat messages.
  • tenant_id — the calling tenant, empty string when the request is untenanted.
  • config — free-form per-tenant plugin config passed through from tenant metadata. Omitted when empty.
  • X-Gateway-Signature — HMAC-SHA256 of the raw request body using the configured signing secret, prefixed with sha256=. Your plugin must verify this header and reject any request whose signature does not match.

Expected response (HTTP 200):

{
"detections": [
{
"category": "JAILBREAK",
"label": "jailbreak-attempt",
"matched_text": "ignore previous instructions",
"risk_score": 0.92,
"rule_id": "jailbreak-v3"
}
]
}
  • category — one of the built-in guardrail categories (INJECTION, JAILBREAK, PROFANITY, VIOLENCE, SEXUAL, COMPETITOR_MENTION, TOPIC_RESTRICTION, CONTENT_POLICY, HALLUCINATION, CUSTOM). Unknown values fall back to CUSTOM.
  • label — short human-readable label, surfaced in audit events. Defaults to the plugin name.
  • matched_text — the fragment that triggered the detection, truncated to 100 characters if longer. Defaults to a truncation of the request text.
  • risk_score — detection confidence in [0.0, 1.0]. Defaults to 0.8.
  • rule_id — stable identifier for this rule inside your plugin. Defaults to the plugin name.

Return an empty detections array when nothing is flagged. A malformed 200 response is treated as an empty result even in CLOSED mode — the HTTP call succeeded, so it is not a transport failure; add schema tests on your plugin side if this matters.

Plugins do not choose their own enforcement action. The action applied when a detection fires is the tenant's configured guardrail action (BLOCK / FLAG / LOG), not something the plugin returns. If you need a single plugin to block while other plugins only log, split the traffic across two tenants with different guardrail.action metadata.

Per-tenant plugin overrides

Tenants can enable, disable, or pass config to individual plugins by setting the guardrail.plugins key in tenant metadata:

{
"guardrail.plugins": {
"internal-jailbreak-classifier": { "enabled": true, "strict": true },
"legal-policy-check": { "enabled": false }
}
}

Any extra keys under a plugin entry are forwarded to the plugin under the config field in the request body.

Metrics

  • gateway_plugin_guardrail_total — Counter, labels: plugin (name), category, action

Error code

  • guardrail_plugin_error — raised only when fail_mode: CLOSED and the plugin is unavailable or returns a non-200. Surfaces as HTTP 500 with type: guardrail_plugin_error.