Gerson

Gerson

Passionate developer specializing in web development, cloud architecture, and system design.

TypeScriptReactNext.jsPythonFastAPISQLNode.jsAWS

One API for Every Model: Vercel AI Gateway in Production

The Vercel AI Gateway gives you a single endpoint that proxies every major LLM provider with built-in observability, automatic failover, and zero data retention. A practical guide to adopting it, configuring fallbacks, and when it actually saves you money.

Gersonhttps://vercel.com/ai-gateway
Fiber optic cables converging, representing a unified network gateway

When the Vercel AI Gateway went GA in August 2025, the pitch was simple: one API for every model. No more juggling three provider SDKs, three API keys, three billing accounts, and three subtly different streaming formats. Point your AI SDK at the gateway, pick a model by string, and keep shipping.

After a few months of using it in production, the story is a bit richer than the pitch. The gateway is genuinely great at the obvious things — provider failover, unified observability, a single invoice — and it also quietly changes how you think about model selection. Below is what it actually is, how to wire it up, and where it pays for itself.

What the Gateway Actually Does

The gateway sits between your app and the model providers (OpenAI, Anthropic, Google, Mistral, xAI, Groq, and others). You call one endpoint; it routes to the upstream provider, streams the response back, and records per-request telemetry: latency, tokens, cost, cache hit rate, and errors.

Three things make it more than a proxy:

  • Automatic failover. Configure a primary model and one or more fallbacks. If the primary 5xxs or times out, the gateway retries against the fallback without your app knowing.
  • Zero data retention. The gateway does not store prompts or completions. Telemetry is metadata only.
  • Provider-agnostic caching. Responses can be cached at the gateway layer across providers, so repeat prompts (think: evals, regression tests, high-overlap tool calls) avoid upstream costs entirely.

Wiring It Up

If you are on the AI SDK v5.2+ or v6, gateway usage is the default when you pass a string model ID. Set one env var and stop thinking about provider SDKs.

.env.local

AI_GATEWAY_API_KEY=vk_live_...

app/api/chat/route.ts

import { streamText, convertToModelMessages } from 'ai';

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: 'anthropic/claude-sonnet-4-6',
    messages: convertToModelMessages(messages),
  });

  return result.toUIMessageStreamResponse();
}

That is it. No @ai-sdk/anthropic install, no ANTHROPIC_API_KEY, no provider-specific config block. Swap anthropic/claude-sonnet-4-6 for openai/gpt-4o or google/gemini-2-pro and it works the same way.

Configuring Fallbacks

Fallbacks live in the Vercel dashboard under your project's AI tab. You define a model group — a primary plus one or more fallbacks — and reference the group name in code. Changing the routing later is a dashboard edit, not a redeploy.

app/api/summarize/route.ts

import { generateText } from 'ai';

export async function POST(req: Request) {
  const { text } = await req.json();

  const { text: summary } = await generateText({
    model: 'group/summarizer',
    prompt: `Summarize in 3 bullets:\n\n${text}`,
  });

  return Response.json({ summary });
}

In the dashboard, group/summarizer might be configured as: primary openai/gpt-4o-mini, fallback anthropic/claude-haiku-4-5, fallback google/gemini-2-flash. If OpenAI throws a 429, the gateway retries Claude. If Claude is also down, it tries Gemini. Your route handler sees a single successful response.

Worth knowing: Fallbacks are not free load-balancing. The gateway only fails over on actual errors (5xx, 429, timeout). It will not spread traffic across providers for latency reasons — for that you want explicit model selection in your own code.

When It Pays For Itself

Three concrete scenarios where the gateway changes the economics:

1. Bursty traffic on the edge of rate limits

If your app sometimes hits OpenAI's TPM limits during peak hours, a fallback to a second provider converts "408 fails" into "a slightly different response." In practice this can be the difference between a 0.1% and a 2% error rate during spikes.

2. Eval and regression test runs

Caching makes a huge dent here. A test suite that replays 200 prompts against a model hits the gateway cache after the first run, and subsequent runs cost pennies. We cut eval CI cost by around 70% after enabling caching on the eval-specific model group.

3. Comparing providers without rewrites

Swapping openai/gpt-4o for anthropic/claude-sonnet-4-6 in one line means a meaningful provider bake-off is an afternoon, not a sprint.

When It Might Not Be Worth It

  • You need provider-specific features the gateway does not proxy yet. Anthropic's fine-grained cache control and OpenAI's Realtime API historically lagged gateway support. Check the feature matrix before committing.
  • You are on one provider and will stay there. If you are Anthropic-only and all-in on Claude features, direct SDK usage has fewer layers to debug.
  • Strict residency. The gateway runs in Vercel's infrastructure. If your compliance story requires calls to originate from a specific region or VPC, direct provider SDKs in your own compute may be the answer.

Resources