Skip to main content

Multimodal & Vision

DVARA routes vision requests — messages that include images alongside text — using the same OpenAI-compatible content array format. Every governance stage (policy, PII scanning, guardrails, budget, audit) applies to vision requests identically to text-only calls. The image payload itself passes through to the upstream provider; DVARA does not decode or store image bytes.

Supported providers

ProviderVisionNotes
OpenAIyesgpt-4o, gpt-4o-mini, gpt-4.1 — native image_url
Azure OpenAIyesSame models as OpenAI via your Azure deployment
Anthropicyesclaude-sonnet-4-5, claude-opus-4-1, claude-3-5-haiku-20241022 — native
Google Geminiyesgemini-2.0-flash, gemini-1.5-pro — native
AWS BedrockyesClaude vision models hosted on Bedrock (Titan models on Bedrock are text-only)
xAI Grokyesgrok-2-vision-1212 and similar — native. Vision is over-declared at the provider level; non-vision Grok variants will return their own model-side error.
MistralnoText models only; Pixtral (pixtral-12b, pixtral-large) is not supported in this release. See Provider capability gaps.
Cohere, Groq, OllamanoReject with UNSUPPORTED_CAPABILITY when image content is sent. See What happens when vision isn't supported.
Qwen, DeepSeek, MoonshotnoOpenAI-compatible providers — image content forwarded to the upstream API verbatim. The upstream rejects with its own error since these providers' chat models are text-only.
ChatGLM (Zhipu)model-specificProvider declares vision capability off at the route level (under-declared conservative). The glm-4v-* model family is vision-capable; image content forwarded to the upstream and the model responds. Other glm variants reject.
Vision capability is not pre-checked by the routing layer

Unlike response_format (where DVARA rejects at the routing layer before any upstream call), there is no routing-level filter for image content. The request always reaches whichever provider the route selects — that provider then either handles it (vision-capable) or rejects with its own error (Cohere / Groq / Ollama via UNSUPPORTED_CAPABILITY, OpenAI-compat providers via the upstream's own model-side error). Mistral specifically replaces image blocks with a stringified placeholder and silently sends garbage to the upstream — no error surfaced. The safe configuration is a route that contains only vision-capable providers.

Sending an image by URL

Pass a publicly accessible image URL in the image_url content block:

curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-dvara-api-key>" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what you see in this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
}
}
]
}
]
}'

Response:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The image shows a transparent PNG demonstration — coloured spheres on a checkered background that indicates transparency."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 272,
"completion_tokens": 28,
"total_tokens": 300
}
}

Sending an image as base64

For private images or images that aren't publicly accessible, encode as base64 and embed inline:

# Encode the image
IMAGE_B64=$(base64 -i /path/to/screenshot.png)

curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-dvara-api-key>" \
-d "{
\"model\": \"gpt-4o\",
\"messages\": [
{
\"role\": \"user\",
\"content\": [
{
\"type\": \"text\",
\"text\": \"What error is shown in this screenshot?\"
},
{
\"type\": \"image_url\",
\"image_url\": {
\"url\": \"data:image/png;base64,${IMAGE_B64}\"
}
}
]
}
]
}"

Supported MIME types depend on the upstream provider. OpenAI, Anthropic, and Gemini all accept image/png, image/jpeg, image/gif, and image/webp.

Python example

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="<your-dvara-api-key>",
)

# Option 1 — image URL
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/chart.png"},
},
],
}
],
)
print(response.choices[0].message.content)

# Option 2 — base64 from disk
image_data = base64.standard_b64encode(Path("chart.png").read_bytes()).decode("utf-8")

response = client.chat.completions.create(
model="claude-sonnet-4-5", # or gpt-4o, gemini-2.0-flash, etc.
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Summarise the trend in this chart."},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_data}"},
},
],
}
],
)
print(response.choices[0].message.content)

Multiple images in one request

All vision-capable providers support multiple images in a single message:

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two diagrams."},
{
"type": "image_url",
"image_url": {"url": "https://example.com/before.png"},
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/after.png"},
},
],
}
],
)

Route setup for vision traffic

Configure a route whose pool contains only vision-capable providers. A vision request that hits a non-vision provider either errors out (Cohere / Groq / Ollama / Qwen / DeepSeek / Moonshot) or silently fails (Mistral, model-specific glm variants). A vision-only route avoids both classes of failure.

Create one via the Admin API:

curl -s -X POST http://localhost:8090/v1/admin/routes \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <admin-pat>" \
-d '{
"id": "vision-route",
"model_pattern": "gpt-4o*",
"strategy": "weighted",
"providers": [
{"provider": "openai", "weight": 60},
{"provider": "anthropic", "weight": 30},
{"provider": "gemini", "weight": 10}
]
}'

A vision request matching gpt-4o* distributes across OpenAI, Anthropic, and Gemini by weight; if one provider's circuit breaker opens the resilience layer fails over to the remaining two. See Routing for strategy details.

What happens when vision isn't supported

Vision capability is not pre-checked by the routing layer. Behaviour when image content reaches a non-vision provider depends on the provider:

ProviderBehaviour when image content arrives
Cohere, Groq, OllamaReject with UNSUPPORTED_CAPABILITY (HTTP 400, invalid_request_error).
Qwen, DeepSeek, MoonshotImage forwarded to the upstream OpenAI-compatible endpoint as a data: URI. The upstream model is text-only and rejects with its own provider error. No special DVARA handling.
ChatGLMSame forward-to-upstream path. The glm-4v-* family accepts the image and responds normally; other glm variants reject.
MistralImage blocks are silently replaced with a stringified placeholder before the request leaves DVARA. The upstream gets garbage; the response is incorrect; no error surfaced. Do not put Mistral on a vision route.

The safest configuration: route only to providers in the "vision: yes" rows of the Supported providers table.

Image size and MIME limits

DVARA does not pre-validate image payloads — size, dimensions, and MIME type are checked by the upstream provider, and oversized or unsupported images surface as the provider's own error. Per-provider caps as of this writing:

ProviderMax per-image sizeAccepted MIME types
OpenAI / Azure OpenAI20 MBimage/png, image/jpeg, image/gif, image/webp
Anthropic5 MBimage/png, image/jpeg, image/gif, image/webp
Google Gemini7 MB inline (larger via Files API, not exposed by DVARA today)image/png, image/jpeg, image/heif, image/webp
AWS Bedrock3.75 MB per image (Claude on Bedrock)image/png, image/jpeg, image/gif, image/webp

Numbers shift over time — when in doubt, treat the smallest cap on your route as the binding limit and verify against the upstream's docs. Base64 encoding inflates payload by ~33%, so a 5 MB raw image lands as ~6.7 MB on the wire — relevant when comparing against the cap.

Vision token cost

Vision requests cost more tokens than text-only requests of similar apparent size. Each provider has its own counting rule:

  • OpenAI / Azure OpenAI — tile-based: a low-detail image is a fixed 85 tokens; high-detail scales with megapixels and aspect ratio (typically 765 tokens for a 1024×1024 image, more for larger).
  • Anthropic — roughly (width × height) / 750 tokens, capped per-image.
  • Gemini258 tokens per image up to a baseline resolution; rescaled images count similarly.
  • Bedrock (Claude) — same shape as Anthropic.

Budget caps are evaluated against the actual upstream-reported token count, so a tenant sending high-resolution images will hit a budget faster than the headline "messages per minute" suggests. Size budget caps for vision-heavy tenants with this in mind, and consider downsampling images client-side before posting if the use case allows.

Governance on vision requests

Policy, PII scanning, guardrails, budget, and audit all apply to vision requests. A few details specific to image input:

  • Policy rules evaluate the text portions of the message (the text content blocks). Image bytes are not evaluated against text-based policy conditions.
  • PII scanning runs on the text blocks only. Image bytes are not decoded or scanned — if PII is rendered as text inside an image (a screenshot of an SSN, for example), PII detection will not catch it.
  • Audit records every vision request as a GATEWAY_RESPONSE event, including model, provider, token counts, and the text portions of the content array. Image bytes are never stored in the audit trail.
  • Budget enforcement counts vision tokens the same way as text tokens — see Vision token cost above for the per-provider counting rules. Set budget caps accordingly for tenants that send images.

Limitations

A few cases worth knowing before you wire up a vision-capable client.

  • No routing-level capability filter. Image content reaches the chosen provider regardless of whether that provider is vision-capable. Use a route that contains only vision-capable providers.
  • No image-byte audit. The image data itself is never logged. The audit trail captures the text content blocks and metadata only.
  • No image-byte scanning. PII or guardrail rules don't decode images — text rendered into an image evades both.
  • No pre-validation of size or MIME. Oversized or unsupported images surface as the upstream's own error, not a DVARA-level rejection.
  • Mistral's silent garble. Mistral specifically replaces image blocks with a stringified placeholder and dispatches the resulting nonsense to the upstream. The response will be incorrect with no error returned. Don't put Mistral on a vision route.

Next steps

  • Provider Setup — capabilities matrix and which providers support vision
  • Routing — configure weighted, latency-aware, or canary routes for vision traffic
  • Cost Management — vision requests generate higher token counts; set budget caps accordingly