Version: 1.3.0

Multimodal & Vision

DVARA routes vision requests — messages that include images alongside text — using the same OpenAI-compatible content array format on POST /v1/chat/completions (image input is also supported on POST /v1/responses, the Responses API). Every governance stage (policy, PII scanning, guardrails, budget, audit) applies to vision requests identically to text-only calls. The image payload itself passes through to the upstream provider; DVARA does not decode or store image bytes.

Supported providers

Provider	Vision	Notes
OpenAI	yes	`gpt-4o`, `gpt-4o-mini`, `gpt-4.1` — native `image_url`
Azure OpenAI	yes	Same models as OpenAI via your Azure deployment
Anthropic	yes	`claude-sonnet-4-5`, `claude-opus-4-1`, `claude-3-5-haiku-20241022` — native
Google Gemini	yes	`gemini-2.0-flash`, `gemini-1.5-pro` — native
AWS Bedrock	yes	Claude vision models hosted on Bedrock (Titan models on Bedrock are text-only)
xAI Grok	yes	`grok-2-vision-1212` and similar — native. Vision is over-declared at the provider level; non-vision Grok variants will return their own model-side error.
Mistral	no	Text models only; Pixtral (`pixtral-12b`, `pixtral-large`) is not supported in this release. See Provider capability gaps.
Cohere, Groq, Ollama	no	Reject with `UNSUPPORTED_CAPABILITY` when image content is sent. See What happens when vision isn't supported.
Qwen, DeepSeek, Moonshot	no	OpenAI-compatible providers — image content forwarded to the upstream API verbatim. The upstream rejects with its own error since these providers' chat models are text-only.
ChatGLM (Zhipu)	model-specific	Provider declares vision capability off at the route level (under-declared conservative). The `glm-4v-*` model family is vision-capable; image content forwarded to the upstream and the model responds. Other glm variants reject.

Vision capability is not pre-checked by the routing layer

Unlike response_format (where DVARA rejects at the routing layer before any upstream call), there is no routing-level filter for image content. The request always reaches whichever provider the route selects — that provider then either handles it (vision-capable) or rejects with its own error (Cohere / Groq / Ollama via UNSUPPORTED_CAPABILITY, OpenAI-compat providers via the upstream's own model-side error). Mistral specifically replaces image blocks with a stringified placeholder and silently sends garbage to the upstream — no error surfaced. The safe configuration is a route that contains only vision-capable providers.

Sending an image by URL

Pass a publicly accessible image URL in the image_url content block:

curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-dvara-api-key>" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe what you see in this image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
            }
          }
        ]
      }
    ]
  }'

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image shows a transparent PNG demonstration — coloured spheres on a checkered background that indicates transparency."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 272,
    "completion_tokens": 28,
    "total_tokens": 300
  }
}

Sending an image as base64

For private images or images that aren't publicly accessible, encode as base64 and embed inline:

# Encode the image
IMAGE_B64=$(base64 -i /path/to/screenshot.png)

curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-dvara-api-key>" \
  -d "{
    \"model\": \"gpt-4o\",
    \"messages\": [
      {
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"text\",
            \"text\": \"What error is shown in this screenshot?\"
          },
          {
            \"type\": \"image_url\",
            \"image_url\": {
              \"url\": \"data:image/png;base64,${IMAGE_B64}\"
            }
          }
        ]
      }
    ]
  }"

Supported MIME types depend on the upstream provider. OpenAI, Anthropic, and Gemini all accept image/png, image/jpeg, image/gif, and image/webp.

Python example

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="<your-dvara-api-key>",
)

# Option 1 — image URL
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/chart.png"},
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

# Option 2 — base64 from disk
image_data = base64.standard_b64encode(Path("chart.png").read_bytes()).decode("utf-8")

response = client.chat.completions.create(
    model="claude-sonnet-4-5",   # or gpt-4o, gemini-2.0-flash, etc.
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarise the trend in this chart."},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_data}"},
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

Multiple images in one request

All vision-capable providers support multiple images in a single message:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two diagrams."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/before.png"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/after.png"},
                },
            ],
        }
    ],
)

Route setup for vision traffic

Configure a route whose pool contains only vision-capable providers. A vision request that hits a non-vision provider either errors out (Cohere / Groq / Ollama / Qwen / DeepSeek / Moonshot) or silently fails (Mistral, model-specific glm variants). A vision-only route avoids both classes of failure.

Create one via the Admin API:

curl -s -X POST http://localhost:8090/v1/admin/routes \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <admin-pat>" \
  -d '{
    "id": "vision-route",
    "model_pattern": "gpt-4o*",
    "strategy": "weighted",
    "providers": [
      {"provider": "openai",    "weight": 60},
      {"provider": "anthropic", "weight": 30},
      {"provider": "gemini",    "weight": 10}
    ]
  }'

A vision request matching gpt-4o* distributes across OpenAI, Anthropic, and Gemini by weight; if one provider's circuit breaker opens the resilience layer fails over to the remaining two. See Routing for strategy details.

What happens when vision isn't supported

Vision capability is not pre-checked by the routing layer. Behaviour when image content reaches a non-vision provider depends on the provider:

Provider	Behaviour when image content arrives
Cohere, Groq, Ollama	Reject with `UNSUPPORTED_CAPABILITY` (HTTP 400, `invalid_request_error`).
Qwen, DeepSeek, Moonshot	Image forwarded to the upstream OpenAI-compatible endpoint as a `data:` URI. The upstream model is text-only and rejects with its own provider error. No special DVARA handling.
ChatGLM	Same forward-to-upstream path. The `glm-4v-` family accepts* the image and responds normally; other glm variants reject.
Mistral	Image blocks are silently replaced with a stringified placeholder before the request leaves DVARA. The upstream gets garbage; the response is incorrect; no error surfaced. Do not put Mistral on a vision route.

The safest configuration: route only to providers in the "vision: yes" rows of the Supported providers table.

Image size and MIME limits

DVARA does not pre-validate image payloads — size, dimensions, and MIME type are checked by the upstream provider, and oversized or unsupported images surface as the provider's own error. Per-provider caps as of this writing:

Provider	Max per-image size	Accepted MIME types
OpenAI / Azure OpenAI	20 MB	`image/png`, `image/jpeg`, `image/gif`, `image/webp`
Anthropic	5 MB	`image/png`, `image/jpeg`, `image/gif`, `image/webp`
Google Gemini	7 MB inline (larger via Files API, not exposed by DVARA today)	`image/png`, `image/jpeg`, `image/heif`, `image/webp`
AWS Bedrock	3.75 MB per image (Claude on Bedrock)	`image/png`, `image/jpeg`, `image/gif`, `image/webp`

Numbers shift over time — when in doubt, treat the smallest cap on your route as the binding limit and verify against the upstream's docs. Base64 encoding inflates payload by ~33%, so a 5 MB raw image lands as ~6.7 MB on the wire — relevant when comparing against the cap.

Vision token cost

Vision requests cost more tokens than text-only requests of similar apparent size. Each provider has its own counting rule:

OpenAI / Azure OpenAI — tile-based: a low-detail image is a fixed 85 tokens; high-detail scales with megapixels and aspect ratio (typically 765 tokens for a 1024×1024 image, more for larger).
Anthropic — roughly (width × height) / 750 tokens, capped per-image.
Gemini — 258 tokens per image up to a baseline resolution; rescaled images count similarly.
Bedrock (Claude) — same shape as Anthropic.

Budget caps are evaluated against the actual upstream-reported token count, so a tenant sending high-resolution images will hit a budget faster than the headline "messages per minute" suggests. Size budget caps for vision-heavy tenants with this in mind, and consider downsampling images client-side before posting if the use case allows.

Governance on vision requests

Policy, PII scanning, guardrails, budget, and audit all apply to vision requests. A few details specific to image input:

Policy rules evaluate the text portions of the message (the text content blocks). Image bytes are not evaluated against text-based policy conditions.
PII scanning runs on the text blocks only. Image bytes are not decoded or scanned — if PII is rendered as text inside an image (a screenshot of an SSN, for example), PII detection will not catch it.
Audit records every vision request as a GATEWAY_RESPONSE event, including model, provider, token counts, and the text portions of the content array. Image bytes are never stored in the audit trail.
Budget enforcement counts vision tokens the same way as text tokens — see Vision token cost above for the per-provider counting rules. Set budget caps accordingly for tenants that send images.

Limitations

A few cases worth knowing before you wire up a vision-capable client.

No routing-level capability filter. Image content reaches the chosen provider regardless of whether that provider is vision-capable. Use a route that contains only vision-capable providers.
No image-byte audit. The image data itself is never logged. The audit trail captures the text content blocks and metadata only.
No image-byte scanning. PII or guardrail rules don't decode images — text rendered into an image evades both.
No pre-validation of size or MIME. Oversized or unsupported images surface as the upstream's own error, not a DVARA-level rejection.
Mistral's silent garble. Mistral specifically replaces image blocks with a stringified placeholder and dispatches the resulting nonsense to the upstream. The response will be incorrect with no error returned. Don't put Mistral on a vision route.

Next steps

Provider Setup — capabilities matrix and which providers support vision
Routing — configure weighted, latency-aware, or canary routes for vision traffic
Cost Management — vision requests generate higher token counts; set budget caps accordingly

Supported providers​

Sending an image by URL​

Sending an image as base64​

Python example​

Multiple images in one request​

Route setup for vision traffic​

What happens when vision isn't supported​

Image size and MIME limits​

Vision token cost​

Governance on vision requests​

Limitations​

Next steps​