Gerson

Gerson

Passionate developer specializing in web development, cloud architecture, and system design.

TypeScriptReactNext.jsPythonFastAPISQLNode.jsAWS

Running Untrusted AI Code Safely with Vercel Sandbox

Vercel Sandbox gives you Firecracker microVMs for executing untrusted or AI-generated code — the kind of isolation you need when your agent wants to run arbitrary Python or Node. A hands-on walkthrough with a code-interpreter tool.

Gersonhttps://vercel.com/docs/vercel-sandbox
Isolated server racks representing sandboxed execution environments

The moment an AI agent can write code, the question stops being "can it help?" and becomes "where do we run this?" Dropping unreviewed Python into eval or a shared container is how you get outbound spam, cryptominers, and an awkward call with AWS abuse. Vercel Sandbox, which went GA in January 2026, is Vercel's answer: ephemeral Firecracker microVMs with filesystem and network isolation, designed to run exactly this kind of untrusted code.

This article covers what Sandbox is, how to wire it into an AI agent as a code-execution tool, and a few of the footguns worth knowing before production.

What a Sandbox Actually Is

Each sandbox is a Firecracker microVM — the same lightweight KVM-based virtualization AWS uses for Lambda and Fargate. You get:

  • A fresh root filesystem per sandbox
  • Your choice of runtime (Node.js 24, Python 3.13, Bun) with common tooling preinstalled
  • A network policy you control — outbound allowlist, or fully offline
  • Automatic teardown after a configurable idle timeout
  • Cold-start under a second; warm-start effectively instant

The isolation boundary is a real VM, not a container — the same reason AWS uses Firecracker for multi-tenant compute. Escaping requires a kernel-level exploit, not just a container escape.

Creating and Running a Sandbox

The SDK is straightforward. You create a sandbox, pipe code into it, and collect stdout and the exit code.

lib/sandbox/run-python.ts

import { Sandbox } from '@vercel/sandbox';

export async function runPython(code: string) {
  const sandbox = await Sandbox.create({
    runtime: 'python3.13',
    idleTimeoutMs: 30_000,
    network: { allow: [] }, // fully offline
  });

  try {
    const result = await sandbox.exec({
      cmd: ['python', '-c', code],
      timeoutMs: 15_000,
    });

    return {
      stdout: result.stdout,
      stderr: result.stderr,
      exitCode: result.exitCode,
    };
  } finally {
    await sandbox.close();
  }
}

Wiring It Into an Agent

The usual shape is exposing the sandbox as a tool on an agent. The model writes code, you run it, the output goes back to the model, loop.

agents/code-interpreter.ts

import { generateText, tool, stepCountIs } from 'ai';
import { z } from 'zod';
import { runPython } from '@/lib/sandbox/run-python';

const executeCode = tool({
  description: 'Run a Python snippet. Return stdout and stderr.',
  inputSchema: z.object({
    code: z.string().describe('Self-contained Python source'),
  }),
  execute: async ({ code }) => runPython(code),
});

export async function solve(problem: string) {
  const { text } = await generateText({
    model: 'anthropic/claude-sonnet-4-6',
    tools: { executeCode },
    stopWhen: stepCountIs(6),
    system: 'Solve problems by writing and running small Python programs.',
    prompt: problem,
  });
  return text;
}

With this tool plus a math question, the agent typically writes, runs, debugs, and re-runs code a few times before producing an answer. Because runPython creates a fresh sandbox on every call, there is no state leakage between invocations — a useful property when the same endpoint serves many users.

Keeping State Across Tool Calls

For notebook-style interaction where the agent builds up state (loads a dataframe, runs multiple queries against it), you can keep a long-lived sandbox and expose two tools: one to write files, one to run code.

lib/sandbox/session.ts

import { Sandbox } from '@vercel/sandbox';

export class SandboxSession {
  private sandbox!: Awaited<ReturnType<typeof Sandbox.create>>;

  async start() {
    this.sandbox = await Sandbox.create({
      runtime: 'python3.13',
      idleTimeoutMs: 10 * 60_000,
    });
  }

  writeFile(path: string, body: string) {
    return this.sandbox.writeFile(path, body);
  }

  exec(cmd: string[]) {
    return this.sandbox.exec({ cmd, timeoutMs: 60_000 });
  }

  close() {
    return this.sandbox.close();
  }
}

Do not forget to close(): Sandboxes cost real money per CPU-second. A sandbox kept alive because an error skipped your cleanup path will quietly burn through budget. Use try/finally or a context-manager style wrapper, and set a conservative idleTimeoutMs as a backstop.

Security Knobs Worth Setting

  • Network policy. Default to allow: []. Opening the network to specific hosts (allow: ['api.yourdata.com']) is safer than leaving it wide open.
  • Timeouts. Set both a per-exec timeout and an overall sandbox idle timeout. A loop or an infinite generator should not run for hours.
  • Resource limits. CPU and memory limits are configurable at create time. A code interpreter exposed to the public should cap at low limits to prevent amplification attacks.
  • No secrets in env. Whatever you put in env is readable from within the sandbox. If your AI-written code only needs to hit one API, pass the token as a runtime argument rather than a long-lived env var.

When To Pick Sandbox vs Alternatives

  • Sandbox — untrusted or AI-generated code, user-uploaded scripts, coding agents, ephemeral execution.
  • Vercel Functions — your own code. Sandbox adds isolation overhead you do not need here.
  • Self-hosted Docker/Firecracker — when you need specialized kernel modules, GPUs, or isolation tuning beyond what Sandbox exposes.
  • E2B / Modal / Daytona — similar positioning to Sandbox; pick based on region coverage and pricing.

Resources