Skip to main content

Prompts & Evals

DVARA's prompt management surface has four tightly related features:

  1. Prompt templates — versioned, variable-substituted prompts with a lifecycle (DRAFT → ACTIVE → ARCHIVED).
  2. Prompt experiments — A/B tests across template variants with per-variant metrics.
  3. Golden prompts + model fingerprinting + drift reports — periodically re-run a fixed prompt against a model to catch silent model changes.
  4. Eval pipeline — batch-run a suite of prompts against a model and score the results.

All four are admin-only surfaces.

Prompt templates

Open Prompts → Templates in the sidebar (or use the Automation API at /v1/admin/prompts).

Template list

Lists every template with ID, name, tenant, status badge (DRAFT / ACTIVE / ARCHIVED), version, and last-updated timestamp. Use the Tenant dropdown filter to narrow the list.

Prompt template list with status badges, version numbers, and tag pillsPrompt template list with status badges, version numbers, and tag pills
Figure 1. Prompt template list with status badges, version numbers, and tag pills

Create / edit template

Click New Template to create a template with:

  • Name — human-readable template name
  • Tenant ID — scope to a tenant (blank = global)
  • System prompt — optional system message, with {{placeholders}} for variables
  • User prompt — the main prompt body, also with {{placeholders}}
  • Status — DRAFT, ACTIVE, or ARCHIVED

Variables are auto-extracted from {{name}} tokens on save. A template starts in DRAFT status; only ACTIVE templates can be referenced from the Playground or request.metadata["prompt_template_id"] on chat completions.

Template editor with variable placeholders, version history, status controls, and render previewTemplate editor with variable placeholders, version history, status controls, and render preview
Figure 2. Template editor with variable placeholders, version history, status controls, and render preview

Render preview

Click Preview (or call POST /v1/admin/prompts/{id}/render) with a map of variable values to see the rendered system / user text before saving. The render endpoint is also how the playground's template loader works.

Version history & rollback

Same mechanism as routes and policies: last 10 versions retained, diff viewer highlights changed fields, rollback creates a new version with the restored content.

Error codes: PROMPT_TEMPLATE_NOT_FOUND (404), PROMPT_TEMPLATE_NOT_ACTIVE (400), PROMPT_VARIABLE_MISSING (400), PROMPT_VERSION_NOT_FOUND (404), INVALID_TEMPLATE_STATUS (400).

Prompt experiments

Open Prompts → Experiments in the sidebar.

An experiment is an A/B test across multiple template variants. Each variant gets a weight (percentage of traffic routed to that variant), and the weights must sum to 100. DVARA tracks per-variant request count, token usage, cost, latency, and error rate in a dedicated metrics collector.

Experiment list

Lists every experiment with ID, name, tenant, status (DRAFT / RUNNING / PAUSED / COMPLETED), variant count, and start time. Filter by tenant.

Experiment list with RUNNING status badge and variant countExperiment list with RUNNING status badge and variant count
Figure 3. Experiment list with RUNNING status badge and variant count

Create / edit experiment

  • Name — identifier
  • Tenant ID — scope to a tenant (blank = global)
  • Variants — a list of {templateId, weight} pairs. Weights must total 100. DVARA rejects lopsided or incomplete configurations.
  • Status — DRAFT to start; transition through RUNNING → PAUSED → COMPLETED.

Lifecycle

  • DRAFT — editable, no traffic is routed
  • RUNNING — traffic is split by weight; metrics collected
  • PAUSED — traffic reverts to the default template, metrics preserved
  • COMPLETED — terminal; read-only

Metrics report

GET /v1/admin/prompts/experiments/{id}/report returns a per-variant snapshot: requests, tokens (in/out), total cost, average latency, error rate. The UI renders this as a side-by-side comparison table.

Experiment report showing variant weights and metrics panelExperiment report showing variant weights and metrics panel
Figure 4. Experiment report showing variant weights and metrics panel

Error codes: PROMPT_EXPERIMENT_NOT_FOUND (404), PROMPT_EXPERIMENT_NOT_RUNNING (400 on metric submissions when the experiment is not in RUNNING state).

Golden prompts & model fingerprinting

Golden prompts are the canaries in your coal mine for silent model changes. DVARA periodically runs a fixed prompt against a model, hashes the response (and a feature-extracted fingerprint), and compares it to a baseline. When the fingerprint diverges beyond a threshold, a drift report is generated.

Open Prompts → Golden Prompts in the sidebar. Requires enterprise license.

Golden prompt list

Lists every golden prompt with ID, name, model, last fingerprint timestamp, drift status badge, and actions. Filter by model.

Golden prompts list with model, drift threshold, baseline status, and enabled badgesGolden prompts list with model, drift threshold, baseline status, and enabled badges
Figure 5. Golden prompts list with model, drift threshold, baseline status, and enabled badges

Create golden prompt

  • Name — identifier
  • Model — the exact model name to test (e.g. gpt-4o-2024-08-06)
  • Prompt — the canary prompt body
  • Expected response — optional, used for stricter match
  • Similarity threshold — cosine similarity below which a result is considered drifted (default: 0.85)
Create golden prompt form with model, prompt, baseline response, and drift thresholdCreate golden prompt form with model, prompt, baseline response, and drift threshold
Figure 6. Create golden prompt form with model, prompt, baseline response, and drift threshold

Trigger a fingerprint check

Click Check Now (or POST /v1/admin/golden-prompts/{id}/fingerprint) to immediately re-run the prompt and compute a new fingerprint. The response includes the current and baseline fingerprints, similarity score, and drift flag.

Drift report

GET /v1/admin/golden-prompts/{id}/drift?days=7 returns the drift history for the last N days (default: 7), including a time series of similarity scores and any events where the threshold was breached.

Drift report with similarity scores, drift rate, and trend indicatorDrift report with similarity scores, drift rate, and trend indicator
Figure 7. Drift report with similarity scores, drift rate, and trend indicator

Under the hood, DVARA vectorizes each response with an embedding model, stores the resulting fingerprint alongside the baseline, and computes cosine similarity on every check. When the similarity drops below the configured threshold, a drift event is recorded.

Eval pipeline

Open Prompts → Eval Prompts (or Eval Reports) in the sidebar. Requires enterprise license.

The eval pipeline is the batch counterpart to golden prompts — run a suite of prompts against a model and produce a scored report rather than a single drift event. Typical use cases: regression-testing a prompt template before promoting it to ACTIVE, or qualifying a new model before routing production traffic.

Eval prompts

CRUD at /v1/admin/eval/prompts. Each eval prompt has:

  • Name — identifier
  • Prompt — the input to send
  • Expected — the expected output (exact match or semantic similarity)
  • Scoringexact-match, embedding-similarity, or custom
Eval prompts list with model, judge model, pass threshold, and enabled statusEval prompts list with model, judge model, pass threshold, and enabled status
Figure 8. Eval prompts list with model, judge model, pass threshold, and enabled status

Running an eval

  • Single eval: POST /v1/admin/eval/{id}/run — runs one prompt against a specified model and returns the scored result.
  • Suite run: POST /v1/admin/eval/suite?model=gpt-4o — runs every configured eval prompt against the given model and produces a report with pass / fail counts, scores, and per-prompt results.

Reports

  • GET /v1/admin/eval/reports — list all eval reports, optionally filtered by ?model=
  • GET /v1/admin/eval/reports/{id} — detailed view of a single report
  • GET /v1/admin/eval/reports/model/{model} — reports for a specific model in a time range
Eval reports list with pass rate, avg score, and run-suite formEval reports list with pass rate, avg score, and run-suite form
Figure 9. Eval reports list with pass rate, avg score, and run-suite form
Eval report detail with summary cards and per-prompt PASS/FAIL resultsEval report detail with summary cards and per-prompt PASS/FAIL results
Figure 10. Eval report detail with summary cards and per-prompt PASS/FAIL results

The eval runner fans out prompts in parallel, respects each model's context window, and aggregates scores into a single report. Reports and per-prompt results are persisted so you can compare runs across models and over time.

API summary

EndpointPurpose
POST /v1/admin/promptsCreate prompt template
GET /v1/admin/promptsList templates (filter by ?tenant_id=)
PUT /v1/admin/prompts/{id}Update template (increments version)
POST /v1/admin/prompts/{id}/statusChange DRAFT/ACTIVE/ARCHIVED
POST /v1/admin/prompts/{id}/renderPreview rendered template
POST /v1/admin/prompts/{id}/rollbackRollback to a prior version
POST /v1/admin/prompts/experimentsCreate experiment
POST /v1/admin/prompts/experiments/{id}/statusLifecycle transitions
GET /v1/admin/prompts/experiments/{id}/reportPer-variant metrics
POST /v1/admin/golden-promptsCreate golden prompt
POST /v1/admin/golden-prompts/{id}/fingerprintTrigger fingerprint check
GET /v1/admin/golden-prompts/{id}/driftDrift history
POST /v1/admin/eval/promptsCreate eval prompt
POST /v1/admin/eval/{id}/runRun a single eval
POST /v1/admin/eval/suite?model=XRun the full eval suite against a model
GET /v1/admin/eval/reportsList eval reports

Every write through the Automation API emits an audit event so a compliance trail of prompt and eval changes exists alongside the Console UI's own emit: PROMPT_TEMPLATE_CREATED / _UPDATED / _STATUS_CHANGED / _ROLLED_BACK / _DELETED, PROMPT_EXPERIMENT_CREATED / _UPDATED / _STATUS_CHANGED / _RESET / _DELETED, GOLDEN_PROMPT_CREATED / _UPDATED / _DELETED / _FINGERPRINTED, EVAL_PROMPT_CREATED / _UPDATED / _DELETED / _RUN, and EVAL_SUITE_RUN.