Multimodal & Vision
DVARA routes vision requests — messages that include images alongside text — using the same OpenAI-compatible content array format. Every governance stage (policy, PII scanning, guardrails, budget, audit) applies to vision requests identically to text-only calls. The image payload itself passes through to the upstream provider; DVARA does not decode or store image bytes.
Supported providers
| Provider | Vision | Notes |
|---|---|---|
| OpenAI | yes | gpt-4o, gpt-4o-mini, gpt-4.1 — native image_url |
| Azure OpenAI | yes | Same models as OpenAI via your Azure deployment |
| Anthropic | yes | claude-sonnet-4-5, claude-opus-4-1, claude-3-5-haiku-20241022 — native |
| Google Gemini | yes | gemini-2.0-flash, gemini-1.5-pro — native |
| AWS Bedrock | yes | Claude vision models hosted on Bedrock (Titan models on Bedrock are text-only) |
| xAI Grok | yes | grok-2-vision-1212 and similar — native. Vision is over-declared at the provider level; non-vision Grok variants will return their own model-side error. |
| Mistral | no | Text models only; Pixtral (pixtral-12b, pixtral-large) is not supported in this release. See Provider capability gaps. |
| Cohere, Groq, Ollama | no | Reject with UNSUPPORTED_CAPABILITY when image content is sent. See What happens when vision isn't supported. |
| Qwen, DeepSeek, Moonshot | no | OpenAI-compatible providers — image content forwarded to the upstream API verbatim. The upstream rejects with its own error since these providers' chat models are text-only. |
| ChatGLM (Zhipu) | model-specific | Provider declares vision capability off at the route level (under-declared conservative). The glm-4v-* model family is vision-capable; image content forwarded to the upstream and the model responds. Other glm variants reject. |
Unlike response_format (where DVARA rejects at the routing layer before any upstream call), there is no routing-level filter for image content. The request always reaches whichever provider the route selects — that provider then either handles it (vision-capable) or rejects with its own error (Cohere / Groq / Ollama via UNSUPPORTED_CAPABILITY, OpenAI-compat providers via the upstream's own model-side error). Mistral specifically replaces image blocks with a stringified placeholder and silently sends garbage to the upstream — no error surfaced. The safe configuration is a route that contains only vision-capable providers.
Sending an image by URL
Pass a publicly accessible image URL in the image_url content block:
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-dvara-api-key>" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what you see in this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
}
}
]
}
]
}'
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The image shows a transparent PNG demonstration — coloured spheres on a checkered background that indicates transparency."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 272,
"completion_tokens": 28,
"total_tokens": 300
}
}
Sending an image as base64
For private images or images that aren't publicly accessible, encode as base64 and embed inline:
# Encode the image
IMAGE_B64=$(base64 -i /path/to/screenshot.png)
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-dvara-api-key>" \
-d "{
\"model\": \"gpt-4o\",
\"messages\": [
{
\"role\": \"user\",
\"content\": [
{
\"type\": \"text\",
\"text\": \"What error is shown in this screenshot?\"
},
{
\"type\": \"image_url\",
\"image_url\": {
\"url\": \"data:image/png;base64,${IMAGE_B64}\"
}
}
]
}
]
}"
Supported MIME types depend on the upstream provider. OpenAI, Anthropic, and Gemini all accept image/png, image/jpeg, image/gif, and image/webp.
Python example
import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="<your-dvara-api-key>",
)
# Option 1 — image URL
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/chart.png"},
},
],
}
],
)
print(response.choices[0].message.content)
# Option 2 — base64 from disk
image_data = base64.standard_b64encode(Path("chart.png").read_bytes()).decode("utf-8")
response = client.chat.completions.create(
model="claude-sonnet-4-5", # or gpt-4o, gemini-2.0-flash, etc.
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Summarise the trend in this chart."},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_data}"},
},
],
}
],
)
print(response.choices[0].message.content)
Multiple images in one request
All vision-capable providers support multiple images in a single message:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two diagrams."},
{
"type": "image_url",
"image_url": {"url": "https://example.com/before.png"},
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/after.png"},
},
],
}
],
)
Route setup for vision traffic
Configure a route whose pool contains only vision-capable providers. A vision request that hits a non-vision provider either errors out (Cohere / Groq / Ollama / Qwen / DeepSeek / Moonshot) or silently fails (Mistral, model-specific glm variants). A vision-only route avoids both classes of failure.
Create one via the Admin API:
curl -s -X POST http://localhost:8090/v1/admin/routes \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <admin-pat>" \
-d '{
"id": "vision-route",
"model_pattern": "gpt-4o*",
"strategy": "weighted",
"providers": [
{"provider": "openai", "weight": 60},
{"provider": "anthropic", "weight": 30},
{"provider": "gemini", "weight": 10}
]
}'
A vision request matching gpt-4o* distributes across OpenAI, Anthropic, and Gemini by weight; if one provider's circuit breaker opens the resilience layer fails over to the remaining two. See Routing for strategy details.
What happens when vision isn't supported
Vision capability is not pre-checked by the routing layer. Behaviour when image content reaches a non-vision provider depends on the provider:
| Provider | Behaviour when image content arrives |
|---|---|
| Cohere, Groq, Ollama | Reject with UNSUPPORTED_CAPABILITY (HTTP 400, invalid_request_error). |
| Qwen, DeepSeek, Moonshot | Image forwarded to the upstream OpenAI-compatible endpoint as a data: URI. The upstream model is text-only and rejects with its own provider error. No special DVARA handling. |
| ChatGLM | Same forward-to-upstream path. The glm-4v-* family accepts the image and responds normally; other glm variants reject. |
| Mistral | Image blocks are silently replaced with a stringified placeholder before the request leaves DVARA. The upstream gets garbage; the response is incorrect; no error surfaced. Do not put Mistral on a vision route. |
The safest configuration: route only to providers in the "vision: yes" rows of the Supported providers table.
Image size and MIME limits
DVARA does not pre-validate image payloads — size, dimensions, and MIME type are checked by the upstream provider, and oversized or unsupported images surface as the provider's own error. Per-provider caps as of this writing:
| Provider | Max per-image size | Accepted MIME types |
|---|---|---|
| OpenAI / Azure OpenAI | 20 MB | image/png, image/jpeg, image/gif, image/webp |
| Anthropic | 5 MB | image/png, image/jpeg, image/gif, image/webp |
| Google Gemini | 7 MB inline (larger via Files API, not exposed by DVARA today) | image/png, image/jpeg, image/heif, image/webp |
| AWS Bedrock | 3.75 MB per image (Claude on Bedrock) | image/png, image/jpeg, image/gif, image/webp |
Numbers shift over time — when in doubt, treat the smallest cap on your route as the binding limit and verify against the upstream's docs. Base64 encoding inflates payload by ~33%, so a 5 MB raw image lands as ~6.7 MB on the wire — relevant when comparing against the cap.
Vision token cost
Vision requests cost more tokens than text-only requests of similar apparent size. Each provider has its own counting rule:
- OpenAI / Azure OpenAI — tile-based: a low-detail image is a fixed 85 tokens; high-detail scales with megapixels and aspect ratio (typically 765 tokens for a 1024×1024 image, more for larger).
- Anthropic — roughly
(width × height) / 750tokens, capped per-image. - Gemini —
258tokens per image up to a baseline resolution; rescaled images count similarly. - Bedrock (Claude) — same shape as Anthropic.
Budget caps are evaluated against the actual upstream-reported token count, so a tenant sending high-resolution images will hit a budget faster than the headline "messages per minute" suggests. Size budget caps for vision-heavy tenants with this in mind, and consider downsampling images client-side before posting if the use case allows.
Governance on vision requests
Policy, PII scanning, guardrails, budget, and audit all apply to vision requests. A few details specific to image input:
- Policy rules evaluate the text portions of the message (the
textcontent blocks). Image bytes are not evaluated against text-based policy conditions. - PII scanning runs on the text blocks only. Image bytes are not decoded or scanned — if PII is rendered as text inside an image (a screenshot of an SSN, for example), PII detection will not catch it.
- Audit records every vision request as a
GATEWAY_RESPONSEevent, including model, provider, token counts, and the text portions of the content array. Image bytes are never stored in the audit trail. - Budget enforcement counts vision tokens the same way as text tokens — see Vision token cost above for the per-provider counting rules. Set budget caps accordingly for tenants that send images.
Limitations
A few cases worth knowing before you wire up a vision-capable client.
- No routing-level capability filter. Image content reaches the chosen provider regardless of whether that provider is vision-capable. Use a route that contains only vision-capable providers.
- No image-byte audit. The image data itself is never logged. The audit trail captures the text content blocks and metadata only.
- No image-byte scanning. PII or guardrail rules don't decode images — text rendered into an image evades both.
- No pre-validation of size or MIME. Oversized or unsupported images surface as the upstream's own error, not a DVARA-level rejection.
- Mistral's silent garble. Mistral specifically replaces image blocks with a stringified placeholder and dispatches the resulting nonsense to the upstream. The response will be incorrect with no error returned. Don't put Mistral on a vision route.
Next steps
- Provider Setup — capabilities matrix and which providers support vision
- Routing — configure weighted, latency-aware, or canary routes for vision traffic
- Cost Management — vision requests generate higher token counts; set budget caps accordingly