Version: 1.3.0

ML & Plugin Guardrails

DVARA's regex-based guardrails catch the obvious stuff (profanity, well-known prompt-injection patterns, PII). For harder detections — novel jailbreaks, subtle manipulation, domain-specific policy violations — DVARA ships two extensibility layers:

ML classifier integration — call out to a commercial classifier (Lakera, Google ShieldGemini, AWS Bedrock Guardrails, Aporia, or a generic endpoint) for injection / jailbreak / policy scoring
External guardrail plugins — call any HTTPS endpoint you run yourself with an HMAC-signed payload

ML classifier integration

The ML classifier runs as part of the gateway's request-side guardrail pipeline. When enabled, it sends the request text to an external classifier, waits for a score, and applies the configured action (BLOCK / FLAG / LOG) based on whether the score exceeds the confidence threshold.

Supported providers

Provider	Value for `dvara.llm-gateway.guardrail.ml-classifier.provider`	Default endpoint
Generic	`generic`	(must be set explicitly)
Lakera Guard	`lakera`	`https://api.lakera.ai/v1/guard`
Google ShieldGemini	`shield-gemini`	derived from `project-id` + `location`
AWS Bedrock Guardrails	`bedrock-guardrails`	derived from `region` + `guardrail-id` (+ `guardrail-version`)
Aporia	`aporia`	(must be set explicitly — the per-project Guardrails URL from the Aporia dashboard)
In-process (self-hosted)	`onnx-injection`	none — runs a bundled ONNX model in the JVM (see below)

Configuration

dvara.llm-gateway.guardrail.ml-classifier.enabled=true
dvara.llm-gateway.guardrail.ml-classifier.provider=lakera
dvara.llm-gateway.guardrail.ml-classifier.api-key=${LAKERA_API_KEY}
dvara.llm-gateway.guardrail.ml-classifier.confidence-threshold=0.8
dvara.llm-gateway.guardrail.ml-classifier.timeout-seconds=5
dvara.llm-gateway.guardrail.ml-classifier.cache-max-size=1000
dvara.llm-gateway.guardrail.ml-classifier.cache-ttl-seconds=300

Property	Default	Description
`enabled`	`false`	Master switch
`provider`	`generic`	`generic`, `lakera`, `shield-gemini`, `bedrock-guardrails`, `aporia`, or `onnx-injection` (in-process)
`endpoint`	auto	Auto-defaulted for Lakera / ShieldGemini; derived for Bedrock Guardrails. Set explicitly for `generic` and `aporia`. Unused for `onnx-injection`.
`model-path`	—	`onnx-injection` only — model dir (`model.onnx` + `tokenizer.json`); blank = bundled in the image
`api-key`	—	Vendor API key (env: `LAKERA_API_KEY`, `GOOGLE_API_KEY`, or your Aporia API key)
`project-id`	—	GCP project ID (ShieldGemini only)
`location`	`us-central1`	GCP region (ShieldGemini only)
`region`	`us-east-1`	AWS region for the Bedrock `ApplyGuardrail` endpoint (Bedrock Guardrails only)
`aws-access-key`	—	AWS access key for SigV4 signing (Bedrock Guardrails only)
`aws-secret-key`	—	AWS secret key for SigV4 signing (Bedrock Guardrails only)
`guardrail-id`	—	Bedrock `guardrailIdentifier` to apply (Bedrock Guardrails only)
`guardrail-version`	`DRAFT`	Bedrock `guardrailVersion` to apply (Bedrock Guardrails only)
`confidence-threshold`	`0.8`	Detections below this score are ignored
`timeout-seconds`	`5`	HTTP call timeout
`cache-max-size`	`1000`	LRU cache max entries (0 disables the cache)
`cache-ttl-seconds`	`300`	Cache entry TTL

Lakera Guard

Lakera's Guard API scores prompts for prompt injection, jailbreak attempts, PII, and toxic content. DVARA flattens the prompt to a single text field ({"input": "<flattened-text>"}) — not the full messages array — then extracts results[0].flagged and picks the highest-confidence flagged category from results[0].category_scores.

Category mapping is two-way only. DVARA maps Lakera's jailbreak → JAILBREAK and prompt_injection → INJECTION. Every other Lakera category (PII, toxic content, anything not in the explicit map) falls through to INJECTION in DVARA's audit events — so if you need per-category telemetry beyond injection vs jailbreak, add a separate guardrail plugin that surfaces those distinctions explicitly.

dvara-gateway:
  environment:
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "lakera"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${LAKERA_API_KEY}"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_CONFIDENCE_THRESHOLD: "0.85"

The env var DVARA actually reads is DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY — the same name regardless of provider. Vendor-convention names like LAKERA_API_KEY and GOOGLE_API_KEY are not auto-mapped; if you keep your vendor key in a LAKERA_API_KEY env var, indirect through it as shown above so Compose / Kubernetes substitutes it at deploy time.

Google ShieldGemini

ShieldGemini is Google's Vertex AI-hosted safety classifier. DVARA derives the endpoint from project-id and location:

dvara-gateway:
  environment:
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "shield-gemini"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROJECT_ID: "my-gcp-project"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_LOCATION: "us-central1"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${GOOGLE_API_KEY}"

Same env-var rule as Lakera: the auth key must land in DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY. A bare GOOGLE_API_KEY is not picked up — indirect through it as shown.

AWS Bedrock Guardrails

Bedrock Guardrails runs the request text through a guardrail you configure in the AWS console (content filters, denied topics, word filters, sensitive-information/PII policies) via the ApplyGuardrail API. DVARA calls the regional bedrock-runtime data-plane endpoint, signed with AWS Signature V4 (service bedrock), and derives the endpoint from region + guardrail-id + guardrail-version.

dvara-gateway:
  environment:
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "bedrock-guardrails"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_REGION: "us-east-1"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_GUARDRAIL_ID: "abcd1234efgh"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_GUARDRAIL_VERSION: "DRAFT"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_AWS_ACCESS_KEY: "${BEDROCK_GUARDRAIL_ACCESS_KEY}"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_AWS_SECRET_KEY: "${BEDROCK_GUARDRAIL_SECRET_KEY}"

Credentials are static config, not the per-tenant BYOK chain. A guardrail account is typically distinct from your model-inference account, so DVARA deliberately does not reuse the provider.bedrock.* credential resolution here — supply a dedicated access-key / secret-key pair scoped to bedrock:ApplyGuardrail. guardrail-version defaults to DRAFT (the working draft); pin a published version number in production.

Category mapping. A GUARDRAIL_INTERVENED response yields the highest-confidence triggered assessment:

Bedrock assessment	DVARA category	Score
Content filter `PROMPT_ATTACK`	`INJECTION`	band → 0.5 / 0.75 / 0.95 (LOW / MEDIUM / HIGH)
Content filter `SEXUAL` / `VIOLENCE`	`SEXUAL` / `VIOLENCE`	band → 0.5 / 0.75 / 0.95
Content filter `INSULTS` / `HATE` / `MISCONDUCT`	`CONTENT_POLICY`	band → 0.5 / 0.75 / 0.95
Denied topic	`TOPIC_RESTRICTION`	`1.0`
Word filter	`PROFANITY`	`1.0`
Sensitive-information / PII policy	`CONTENT_POLICY`	`1.0`

DVARA's own confidence-threshold still applies on top of Bedrock's decision: with the default 0.8, a content filter that fired at LOW confidence (score 0.5) is dropped by DVARA even though Bedrock flagged it. Topic / word / PII policies report 1.0 (Bedrock has already applied its configured strength) and so always pass the threshold. Lower confidence-threshold if you want DVARA to honour every band Bedrock returns.

Aporia

Aporia exposes a per-project Guardrails URL that you copy from the Aporia dashboard and set as endpoint — there is no universal default, so both endpoint and api-key are required.

dvara-gateway:
  environment:
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENABLED: "true"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_PROVIDER: "aporia"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_ENDPOINT: "https://gr-prd.aporia.com/<your-project>"
    DVARA_LLM_GATEWAY_GUARDRAIL_ML_CLASSIFIER_API_KEY: "${APORIA_API_KEY}"

DVARA sends the request as a single user message with validation_target: prompt and authenticates with the X-APORIA-API-KEY header. Any response action other than passthrough (e.g. block, modify, rephrase) is treated as a detection. The category is inferred from the violated policy name (jailbreak → JAILBREAK, injection / prompt → INJECTION, sexual → SEXUAL, violen* → VIOLENCE, profan* / toxic → PROFANITY, topic / restrict → TOPIC_RESTRICTION, everything else → CONTENT_POLICY). Aporia applies its own per-policy thresholds, so a triggered action is reported at confidence 1.0 unless the payload carries an explicit numeric score — which is then gated by DVARA's confidence-threshold.

Self-hosted in-process classifier (no SaaS, no sidecar)

Every provider above is out-of-process — an HTTP call to a SaaS or a sidecar. The onnx-injection provider is the exception: it runs a prompt-injection classifier (protectai/deberta-v3-base-prompt-injection-v2, Apache-2.0) entirely inside the gateway JVM via ONNX Runtime — no HTTP, no API key, no egress, and no per-call network round-trip. This is the path for data-residency / compliance-bound and air-gapped deployments that want ML-quality injection detection without a second service.

dvara:
  llm-gateway:
    guardrail:
      ml-classifier:
        enabled: true
        provider: onnx-injection
        confidence-threshold: 0.85
        # model-path: /opt/models/deberta-injection   # optional; blank = the model bundled in the image

Every prompt is classified in-process; text scoring at/above confidence-threshold on the injection class becomes an INJECTION detection flowing through the same BLOCK / FLAG / LOG pipeline as the HTTP providers. Fail-safe: if the model can't be loaded, the classifier reports unavailable and the gateway falls back to the NoOp classifier (the request path is never broken); any inference error fails open. The model loads from model-path (a directory with model.onnx + tokenizer.json) or the copy bundled in the gateway image — so an air-gapped operator points model-path at a pre-staged local directory. CPU-only; single-purpose injection detection in v1 (multi-category models are a later addition).

Caching

ML classifier calls add latency (typically 50–200 ms). DVARA caches results by the SHA-256 of the flattened request text with an LRU of cache-max-size entries and a TTL of cache-ttl-seconds. Repeat prompts hit the cache and skip the classifier call. (Two requests that differ only in their messages framing but flatten to the same text will share a cache entry — for compliance setups that need every request scored, disable the cache.)

Set cache-max-size=0 to disable caching entirely (useful if you want every request scored for compliance reasons).

Metrics

gateway_ml_guardrail_total — Counter, labels: provider, category, action

When to use it

You're already paying for Lakera or ShieldGemini and want a second layer beyond regex patterns
Your threat model includes novel jailbreaks that regex can't catch
Compliance requires an independent safety assessment

If your regex-based guardrails are already catching what you need, adding an ML classifier is optional.

Self-hosted open-source alternatives

You don't need a Lakera or ShieldGemini contract to use the ML hook. The provider: generic path will call any HTTP endpoint that accepts POST {"text": "..."} and returns {"label": "...", "confidence": 0.x}, so any of the following can be deployed as a sidecar and slotted in without changing gateway code.

Service	License	What it detects	How to point DVARA at it
ProtectAI LLM Guard	Apache 2.0	Prompt injection, jailbreak, toxicity, anonymization, code detection (composite Python service)	Deploy on Kubernetes with the published Helm chart; set `dvara.llm-gateway.guardrail.ml-classifier.endpoint` to its `/scan` endpoint.
NVIDIA NeMo Guardrails	Apache 2.0	Programmable rails — injection / jailbreak / topic boundaries via Colang policies	Run the NeMo server, configure a rail for prompt injection, point DVARA at the resulting REST endpoint.
Microsoft Presidio	MIT	NER-based PII detection	Already integrated via the dedicated PII Detection configuration path; mentioned here for completeness.

For a deployment that wants a real ML classifier but rejects every SaaS dependency, the practical shape is:

Pull a small open-weight model — for example protectai/deberta-v3-base-prompt-injection-v2 (~150 MB ONNX, Apache 2.0, ~98% F1 on the prompt-injection benchmark) or deepset/deberta-v3-base-injection-v2 (~95% F1).
Serve it via HuggingFace TEI, vLLM, or NVIDIA Triton (all Apache 2.0 / BSD), or via a thin Python wrapper using transformers + fastapi.

Set:

dvara:
  llm-gateway:
    guardrail:
      ml-classifier:
        enabled: true
        provider: generic
        endpoint: http://injection-classifier.internal:8080/classify
        confidence-threshold: 0.85

The gateway treats the response identically to a Lakera or ShieldGemini response.

External guardrail plugins

Plugins are the "everything else" path: any HTTPS endpoint you run yourself, called over HTTP with an HMAC-signed payload. Use this when you want to integrate your own ML model, a domain-specific rules engine, or a third-party vendor that isn't natively supported.

Plugins run inside the request-side guardrail pipeline, after the built-in guardrails and in parallel with the ML classifier. Each plugin is registered with a unique name, a URL, an HMAC secret, a timeout, and a fail mode that determines what happens when the plugin is unavailable.

Master switch

Turn the plugin subsystem on with a single property:

dvara:
  llm-gateway:
    guardrail:
      plugins:
        enabled: true

With the switch off (the default), plugin definitions are ignored even if they exist in the database.

Managing plugin definitions

Plugin definitions live in the DVARA control plane, not in a YAML file. Three surfaces, picked by audience:

DVARA Flightdeck Console at /guardrail-plugins — platform owner role. The list view shows tenant-scoped and platform-global definitions together; an optional ?tenant_id= filter narrows to one tenant. Operators create, edit, rotate, and delete both global and per-tenant plugins from here.
DVARA Flightdeck tenant portal at /portal/guardrail-plugins — tenant admin and developer roles. Scoped to the caller's own tenant — global plugins are not shown here even though they still apply at evaluation time.
Automation API under /v1/admin/guardrail-plugins — any deployment automation; reaches the same rows the two UIs do.

Create a plugin:

curl -X POST http://localhost:8090/v1/admin/guardrail-plugins \
  -H "Authorization: Bearer $DVARA_PAT" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "internal-jailbreak-classifier",
    "url": "https://safety.internal.example.com/classify",
    "secret": "<hmac-signing-secret>",
    "timeout_ms": 3000,
    "fail_mode": "OPEN",
    "enabled": true,
    "tenant_id": null
  }'

Set tenant_id to a tenant ID to scope the plugin to that tenant only; leave it null for a platform-global plugin. Tenant-scoped plugins shadow platform-global plugins with the same name.

Endpoint	Purpose
`POST /v1/admin/guardrail-plugins`	Create a plugin definition (secret encrypted at rest)
`GET /v1/admin/guardrail-plugins`	List definitions (optional `?tenant_id=` filter; secrets masked)
`GET /v1/admin/guardrail-plugins/{id}`	Fetch one definition
`PUT /v1/admin/guardrail-plugins/{id}`	Update URL, timeout, fail mode, enabled flag
`POST /v1/admin/guardrail-plugins/{id}/rotate-secret`	Rotate the HMAC signing secret
`DELETE /v1/admin/guardrail-plugins/{id}`	Delete a definition
`POST /v1/admin/guardrail-plugins/{id}/test`	Send a synthetic request to this definition and return latency + detections

Changes propagate to every running gateway instance within a few seconds; no restart is required.

Runtime registry snapshot. Two read-only endpoints reflect the in-memory registry used by the request path, which can be useful when debugging:

Endpoint	Purpose
`GET /v1/admin/guardrail/plugins`	List plugins currently live in the request path with their availability status
`POST /v1/admin/guardrail/plugins/{name}/test`	Test a live plugin by name (not by ID)

Fail modes

OPEN — if the plugin HTTP call times out or returns an error, the gateway treats the call as producing no detections and the request proceeds. Use this for optional / advisory plugins.
CLOSED — if the plugin fails, the gateway rejects the request. In the current release this surfaces as HTTP 500 with type: guardrail_plugin_error and code: guardrail_plugin_error; match on the code field rather than the status if you need to distinguish plugin failure from other 500-class errors. Use CLOSED for compliance-critical plugins where a failing check must stop the request.

Start with OPEN for every new plugin. Only promote to CLOSED once you're confident in the plugin's uptime.

Request and response contract

DVARA sends the following payload to the plugin URL:

POST /classify HTTP/1.1
Content-Type: application/json
X-Gateway-Signature: sha256=<hex-hmac>

{
  "text": "The concatenated prompt text the gateway wants scored",
  "tenant_id": "acme-corp",
  "config": {
    "strict": true
  }
}

text — the full scannable prompt text for this request, already flattened from the chat messages.
tenant_id — the calling tenant, empty string when the request is untenanted.
config — free-form per-tenant plugin config passed through from tenant metadata. Omitted when empty.
X-Gateway-Signature — HMAC-SHA256 of the raw request body using the configured signing secret, prefixed with sha256=. Your plugin must verify this header and reject any request whose signature does not match.

Expected response (HTTP 200):

{
  "detections": [
    {
      "category": "JAILBREAK",
      "label": "jailbreak-attempt",
      "matched_text": "ignore previous instructions",
      "risk_score": 0.92,
      "rule_id": "jailbreak-v3"
    }
  ]
}

category — one of the built-in guardrail categories (INJECTION, JAILBREAK, PROFANITY, VIOLENCE, SEXUAL, COMPETITOR_MENTION, TOPIC_RESTRICTION, CONTENT_POLICY, HALLUCINATION, CUSTOM). Unknown values fall back to CUSTOM.
label — short human-readable label, surfaced in audit events. Defaults to the plugin name.
matched_text — the fragment that triggered the detection, truncated to 100 characters if longer. Defaults to a truncation of the request text.
risk_score — detection confidence in [0.0, 1.0]. Defaults to 0.8.
rule_id — stable identifier for this rule inside your plugin. Defaults to the plugin name.

Return an empty detections array when nothing is flagged. A malformed 200 response is treated as an empty result even in CLOSED mode — the HTTP call succeeded, so it is not a transport failure; add schema tests on your plugin side if this matters.

Plugins do not choose their own enforcement action. The action applied when a detection fires is the tenant's configured guardrail action (BLOCK / FLAG / LOG), not something the plugin returns. If you need a single plugin to block while other plugins only log, split the traffic across two tenants with different guardrail.action metadata.

Per-tenant plugin overrides

Tenants can enable, disable, or pass config to individual plugins by setting the guardrail.plugins key in tenant metadata:

{
  "guardrail.plugins": {
    "internal-jailbreak-classifier": { "enabled": true, "strict": true },
    "legal-policy-check": { "enabled": false }
  }
}

Any extra keys under a plugin entry are forwarded to the plugin under the config field in the request body.

Metrics

gateway_plugin_guardrail_total — Counter, labels: plugin (name), category, action

Error code

guardrail_plugin_error — raised only when fail_mode: CLOSED and the plugin is unavailable or returns a non-200. Surfaces as HTTP 500 with type: guardrail_plugin_error.

Guardrails overview — regex-based PII, injection, and content guardrails
Hallucination Detection — output-side grounding checks
Observability — guardrail metrics dashboards

ML classifier integration​

Supported providers​

Configuration​

Lakera Guard​

Google ShieldGemini​

AWS Bedrock Guardrails​

Aporia​

Self-hosted in-process classifier (no SaaS, no sidecar)​

Caching​

Metrics​

When to use it​

Self-hosted open-source alternatives​

External guardrail plugins​

Master switch​

Managing plugin definitions​

Fail modes​

Request and response contract​

Per-tenant plugin overrides​

Metrics​

Error code​

Related​

ML classifier integration

Supported providers

Configuration

Lakera Guard

Google ShieldGemini

AWS Bedrock Guardrails

Aporia

Self-hosted in-process classifier (no SaaS, no sidecar)

Caching

Metrics

When to use it

Self-hosted open-source alternatives

External guardrail plugins

Master switch

Managing plugin definitions

Fail modes

Request and response contract

Per-tenant plugin overrides

Metrics

Error code

Related