Skip to main content

One post tagged with "architecture"

View All Tags

Why an AI Gateway? What API Gateways Can't Do for LLM and Agent Traffic

· 8 min read
Dvara Team

Your API gateway handles TLS termination, rate limiting, and request routing. It does these things well. But when your traffic carries prompts, tokens, tool calls, and model-specific payloads, an API gateway becomes a passthrough — it can see the HTTP envelope but not the AI semantics inside it.

This post explains the six gaps that emerge when teams try to govern AI workloads with traditional API gateways, and how a purpose-built AI gateway closes them.

API Gateway vs AI Gateway


1. API Gateways Route Requests. AI Gateways Translate Them.

An API gateway routes POST /v1/chat/completions to a backend. It doesn't know or care what's in the body.

An AI gateway does something fundamentally different — it translates between incompatible provider APIs. When you send an OpenAI-format request with model: claude-sonnet-4-5, Dvara:

  • Strips system role messages from the messages array and passes them as Anthropic's separate system field
  • Defaults max_tokens to 1024 (Anthropic requires it; OpenAI doesn't)
  • Translates response_format: { type: "json_schema" } into Anthropic's tool-use rewrite pattern, then extracts the structured output from tool_use blocks back into a standard response

For Gemini, it maps to generationConfig.responseMimeType + responseSchema. For Bedrock, it rewrites the auth as SigV4 and restructures tool calls into toolConfig format.

None of this is configurable in Kong or NGINX. It requires deep knowledge of each provider's API contract, and it changes every time a provider ships a new feature.

What this means for your team

Your application code stays on one SDK (OpenAI-compatible). Switching from GPT-4o to Claude to Gemini is a one-field change. No SDK swap, no integration rewrite, no deployment.


2. Request-Based Rate Limiting vs Token-Based Economics

API gateways rate-limit by requests per second — a metric designed for REST APIs where requests have roughly uniform cost. AI workloads break this model:

MetricREST APILLM API
Cost driverRequest countToken count (input + output)
Cost variance~uniform1,000x (a 10-token prompt vs a 128K-context conversation)
Budget unit$/request$/million tokens, varies by model
Overspend signalHTTP 429A $500 bill from a runaway agent loop

Dvara enforces token-based budget caps — daily, weekly, or monthly — with per-model pricing, soft-limit warnings, hard-limit blocks, and automatic model downgrade (e.g., switch from GPT-4o to GPT-4o-mini when a tenant hits 80% of budget). It attributes every dollar to a tenant, API key, model, and provider.

An API gateway counting requests per second has no way to know that one request costs $0.001 and the next costs $2.40.


3. Access Logging vs AI-Aware Audit Trail

An API gateway access log tells you:

POST /v1/chat/completions 200 450ms 2.1KB api_key=sk-abc***

Dvara's audit trail tells you:

{
"event_type": "GATEWAY_RESPONSE",
"tenant_id": "acme",
"model": "gpt-4o",
"provider": "openai",
"input_tokens": 1200,
"output_tokens": 340,
"cost_usd": 0.018,
"policy_decision": "ALLOW",
"pii_action": "REDACT",
"pii_entities": ["SSN", "SSN"],
"trace_id": "abc-123",
"session_id": "agent-session-456",
"budget_utilization_pct": 62
}

Every event is HMAC-SHA256 signed and hash-chained — each event's signature includes the previous event's hash, creating a tamper-evident chain. This is what auditors need for SOC2 Type II, HIPAA, and GDPR compliance. An API gateway's access log is an unsigned text file.


4. No Plugin Can Scan Prompts for PII

When a developer sends a chat completion request, the prompt might contain customer SSNs, credit card numbers, medical record numbers, or email addresses. An API gateway sees this as an opaque JSON body. It can't:

  • Parse the messages array and scan each message's content for PII patterns
  • Apply per-tenant rules (tenant A wants PII blocked, tenant B wants it redacted, tenant C wants it logged)
  • Replace detected PII with reversible tokens ([PII:SSN:tok_abc123]) that can be de-tokenized later for authorized users
  • Scan the LLM response for PII output leaks (the model hallucinating or echoing sensitive data)

Dvara's PII engine runs 14 built-in regex patterns with checksum validation (Luhn for credit cards, DEA checksums for prescriber IDs) on both requests and responses. Per-tenant configuration controls the action: BLOCK (reject the request), REDACT (replace with tokens), or LOG (pass through but record the detection).

This isn't a plugin you can bolt onto NGINX. It requires understanding the LLM message format, maintaining a per-tenant token store, and coordinating with the audit trail.


5. Prompt Injection Detection Doesn't Exist in API Gateways

OWASP lists prompt injection as the #1 risk for LLM applications. API gateways have no concept of it. Dvara scans every request through 32 injection detection patterns covering:

  • Jailbreak attempts — "ignore previous instructions", role-play attacks, encoding tricks
  • Direct injection — system prompt override, delimiter attacks
  • Indirect injection — instructions embedded in tool call results or retrieved documents
  • System prompt extraction — attempts to leak the system prompt (OWASP LLM07)

On the response side, Dvara scans for output sanitization issues (OWASP LLM05): XSS payloads, SQL injection fragments, command injection, and SSRF URLs that a model might generate.

It also enforces input size limits (OWASP LLM10): maximum messages per request, maximum message length, and maximum input tokens — preventing resource exhaustion attacks that an API gateway's byte-level limits can't meaningfully address.


6. MCP Tool Governance Is a New Category

The Model Context Protocol lets AI agents call tools — databases, file systems, Slack, GitHub. When an agent decides to call DROP TABLE users through a Postgres MCP server, something needs to intervene before that call reaches the database.

API gateways have no concept of:

  • Tool-level policy — "deny the write_file tool for tenant X" or "require human approval for any tool on server prod-database"
  • Agent loop detection — the same tool called 50 times in a row, or an A-B-A-B cycle that indicates the agent is stuck
  • Human-in-the-loop approval gates — blocking a tool call until a human approves or denies it, with configurable timeouts and webhook notifications
  • Session tracking — correlating a sequence of LLM calls and tool calls into one agent session, with kill-switch capability
  • Argument scanning — checking tool call arguments for PII before they reach the MCP server, and scanning responses for PII output leaks

Dvara's MCP Proxy is a dedicated process (port 8070) that sits between your agents and your MCP servers. It runs the same policy engine, PII detector, and audit trail as the LLM Gateway — governed from the same control plane, same admin UI, same compliance reports.


Where Does the API Gateway Fit?

You don't replace your API gateway with Dvara. They serve different layers:

LayerResponsibilityTool
EdgeTLS termination, global rate limiting, IP filtering, DDoS protectionNGINX / ALB / Cloudflare
AI GovernanceProvider translation, policy, PII, guardrails, cost, audit, MCP governanceDvara
UpstreamLLM inference, tool executionOpenAI API, MCP servers

Your API gateway handles the network. Dvara handles the AI semantics. They compose, not compete.

Internet → [API Gateway] → [Dvara LLM Gateway] → OpenAI / Anthropic / Gemini

Agents → [API Gateway] → [Dvara MCP Proxy] → postgres:// slack:// filesystem://

The Cost of Not Having an AI Gateway

Teams that try to govern LLM traffic through API gateways typically hit these failure modes within the first quarter:

  1. Surprise bills — No token-based budgets. A runaway agent loop racks up thousands of dollars before anyone notices.
  2. PII leaks to providers — Customer data flows through prompts to external LLMs with no scanning or redaction.
  3. Provider lock-in — Direct SDK integrations with one provider. Switching requires weeks of code changes.
  4. Compliance gaps — Auditors ask for tamper-evident logs of every AI interaction. Access logs don't qualify.
  5. Ungoverned tool calls — Agents call destructive tools with no policy check, no approval gate, no audit trail.
  6. Shadow AI — Without a governed gateway, teams route around IT and call LLM APIs directly.

Getting Started

Dvara is a single binary. Run it with one provider key and you have a governed LLM gateway in under 5 minutes:

OPENAI_API_KEY=sk-... ./mvnw -pl gateway-server spring-boot:run

Then point any OpenAI SDK at http://localhost:8080:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="any-key")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello, world!"}]
)

Every request now flows through Dvara's governance filter chain — routing, failover, rate limiting, and audit — even before you enable the enterprise features.

For the full setup guide, see the Quickstart.


Further Reading