The prototype-to-production gap most teams hit in week one

Every team that picks up the Claude Agent SDK starts the same way. The quick-start runs in a notebook, a model picks tools, a task completes in eight turns, and the room agrees the agent is shipping next sprint. Then it goes near a real workload and the wheels come off in week one.

The pattern is consistent. A Bangalore-based startup wires the SDK into a customer-support flow and watches a single ambiguous ticket spin the agent into a 240-turn loop that burns through the month's API budget by Wednesday. A London fintech connects the same SDK to a research workflow and discovers the agent has shell-access scopes it should never have seen, because the prototype handed every tool to every session. A Mumbai analytics team ships the agent on Vercel Functions and watches half its long-running tasks die at the 60-second timeout, never finishing.

None of these are bugs in the SDK. They are the expected outcome of treating a prototype harness as a production harness. The fix is not more model intelligence — it is the boring, load-bearing infrastructure that wraps the model: budgets, permissions, hooks, and a clear deployment shape. This piece is the playbook builders we trust actually use to close that gap.

Pro tip

If your agent goes from prototype to a paying customer without picking up at least three of: per-task budget, scoped tool permissions, a pre-tool hook, a recovery hook, and an explicit deployment shape — you are shipping a prototype with a production logo on it. The cost will catch up within two billing cycles.

Inside the harness — the five layers around the model

The harness is the bit of the Claude Agent SDK that sits between your code and the model. Most teams skim past it because the prototype works without thinking about it. In production, every interesting failure happens inside one of its layers.

There are five layers worth knowing by name, because when something goes wrong you need to know which one to instrument.

Layer What it does Failure mode if missing
Permission pipeline Decides which tool calls are allowed for the current session, tenant and task. Agents accidentally use tools they were never meant to see.
Context-management system Compacts, summarises and prunes turns as the conversation grows. Context window blows out; latency and cost balloon mid-task.
Sandboxing layer Confines tool execution — filesystem, network, shell — to a defined perimeter. One prompt-injected tool call escapes into the host.
Tool router Dispatches model-emitted tool calls to the right implementation, with the right credentials. Wrong-tenant data leaks across calls; debug traces become unreadable.
Recovery infrastructure Restarts, replays or skips after partial failure without losing the task. A single transient error kills a 40-minute agent run.

The five layers are not optional and they are not interchangeable. Skipping any one of them is the equivalent of shipping a web app with no rate limiter — it works until it doesn't, and the failure mode is loud.

Budgets are the first thing you wire, not the last

The mistake most teams make is treating budgets like an observability concern — something the finance team will ask about after the bill arrives. By then it is too late. A runaway agentic loop can spend a month of allocated budget in an afternoon. The harness needs hard caps, applied at three levels, before a single production user touches the agent.

Budget tier Example cap What triggers it
Per-task $0.50 input + $1.00 output, max 25 tool calls A single user request goes into runaway tool loops or context-blowout territory.
Per-user $5.00 per hour, $30.00 per day One enthusiastic user (or a compromised account) tries to grind the agent through a million tokens before lunch.
Per-tenant $500 per day, $10,000 per month A tenant's whole user base hammers the agent and the bill threatens margin.

Hard caps protect margin before a runaway loop eats it. Soft caps — warnings, emails, dashboards — are useful only as a layered signal. They are not a substitute for a circuit breaker. If a per-task budget is exceeded, the harness must refuse to dispatch the next tool call. Not log it. Not warn. Refuse.

Recommended

Wire the budget check as a pre-tool hook in the permission pipeline, not as a wrapper around the SDK call. Hook-based budgets survive retries, recovery and tool-router edge cases that wrapper-based budgets quietly skip. A Bangalore agent startup we know moved their budget logic from wrapper to hook and watched their incident rate on cost overruns drop to zero in a fortnight.

Least-privilege tool scoping — session, tenant, task

Most prototypes pass the union of every capability to every session. Production demands the opposite: each session gets only the tools it strictly needs, scoped to the tenant making the request and the task it is currently working on. Three dimensions, not one.

The principle is straightforward, but the implementation is where teams slip up. Consider a customer-support agent that can read tickets, escalate to a human, and refund up to a fixed amount. In production, you want:

  • Session scope — this signed-in agent operator can use tools A and B, but tool C requires a manager session.
  • Tenant scope — even when tool A is allowed, it only reads data belonging to the tenant currently in scope. No cross-tenant reads, ever.
  • Task scope — within this specific support task, only the refund tool that matches this customer's currency is callable; other-currency refunds are silently dropped.
// permission-pipeline.ts — tool scoping at three levels
const allowTool = (call, ctx) => {
  // 1. Session scope: role check
  if (!ctx.session.roles.includes(call.tool.requiredRole)) return false;

  // 2. Tenant scope: tool must be enabled for this tenant
  if (!ctx.tenant.enabledTools.has(call.tool.name)) return false;

  // 3. Task scope: tool argument must be tenant-owned
  if (call.tool.name === 'refund') {
    if (call.args.currency !== ctx.tenant.currency) return false;
    if (call.args.amount > ctx.task.maxRefund) return false;
  }
  return true;
};

The reason for three dimensions is brutally practical. Session scope alone leaks data between tenants. Tenant scope alone gives every operator the same authority. Task scope alone is meaningless without the first two. You need all three composed in a single decision.

Watch out

Do not implement tool scoping by filtering the tool list shown to the model. A jailbroken or prompt-injected model can emit a tool call for a tool it never saw in its tool definitions. The permission check must run on the inbound tool call, not on what the model was shown. Filtering the tool list is presentation logic; gating tool execution is the security boundary.

Why permission enforcement is a separate code path from reasoning

This is the single most important architectural pattern in the harness, and the one most prototype-grade agents get wrong. The model decides what it wants to do. A different system decides whether it is allowed.

Why does this matter? Because every model is, by construction, jailbreakable. There is no prompt you can write that makes a model refuse to even consider an instruction. Sooner or later an attacker — or just an unlucky user input — will produce a prompt that talks the model into emitting a tool call it should never make. If the only check on that tool call is "did the model agree the call was safe?", then the entire safety story collapses the moment the model is wrong.

The right design is structural. The reasoning lives in the model. The permission check lives in a different process, with its own policy file, its own deployment cadence, and ideally its own engineering owner. A jailbroken model can emit any tool call it likes; the permission check will reject it because the check is a different code path that the model cannot influence.

Avoid

Asking the model to enforce its own permissions ("only call the refund tool if the user has authority"). This is theatre. The model will sometimes comply and sometimes not. The harness is the right place for enforcement, not the prompt.

This is the same separation-of-duties principle that database engineers learned three decades ago: the application decides what query to run, the database decides whether the user is allowed to run it. The agent SDK harness lets you reproduce that separation between model and policy. Use it.

Hooks: the production primitive most prototypes skip

For production deployment, hooks are non-negotiable. They are the official extension points where you wire budgets, audit logs, content filters, tenant isolation and recovery logic. Every prototype skips them. Every production agent has at least four.

The minimum set of hooks for a production agent looks like this:

  • pre-tool — Budget check, permission check, content-safety check on tool arguments. Refuses or modifies the call before dispatch.
  • post-tool — Audit log entry, cost accounting, observability span emission. Runs after every tool result, success or failure.
  • on-context-overflow — Custom compaction policy, summarisation handoff, or task-end if compaction is not acceptable for the workload.
  • on-error — Recovery strategy: retry with smaller context, switch tools, escalate to human, or hard-fail. Without this, transient errors become task failures.
// hooks.ts — minimum production hook configuration
export const hooks = {
  preTool: async (call, ctx) => {
    if (!await budget.check(ctx.task.id, ctx.user.id, ctx.tenant.id)) {
      throw new BudgetExceeded(ctx);
    }
    if (!allowTool(call, ctx)) throw new PermissionDenied(call, ctx);
    return call;
  },
  postTool: async (call, result, ctx) => {
    await audit.write({ call, result, ctx, ts: Date.now() });
    await cost.record(ctx.task.id, result.usage);
  },
  onContextOverflow: async (turns, ctx) => {
    return summariser.compact(turns, { keepLast: 5, summariseRest: true });
  },
  onError: async (err, ctx) => {
    if (err.retryable && ctx.task.retries < 3) return { retry: true };
    return { humanEscalate: true };
  },
};

These four hooks are the contract surface between the model loop and your business. Treat them as such — review them like you review your auth middleware, not like you review a debug print.

Deployment shape matters more than people think

Different deployment shapes impose different constraints on state, timeouts and streaming. A long-running, tool-heavy agent forced into a 15-minute Lambda window will die in production. A short, stateless agent stuck inside a long-lived container is paying for idle compute it does not need. Match the shape to the workload.

Shape State Timeout Streaming Best-fit agent profile
Vercel Functions None between invocations ~300s (Pro) Yes, SSE-native Short, stateless chat agents that respond to a single user turn.
AWS Lambda Optional (DynamoDB / S3) 15 minutes Yes, with response streaming Bursty, mid-length tasks with mild state — workflows under 10 minutes.
Long-running container Native, in-process or sidecar Unbounded Native Long-running, tool-heavy, multi-step agents — code generation, deep research, document review.

Two regional notes worth flagging. A UK financial-services workload often needs data residency in-region — a long-running container in London on a private VPC frequently wins by default. An Indian SaaS workload with bursty consumer traffic often does better on AWS Lambda in Mumbai with response streaming on; the cost curve is more forgiving at scale and the local-region latency is acceptable.

Pick the shape that fits the agent's runtime profile rather than forcing the agent into a serverless box because that is what the rest of the stack uses. For a deeper deployment-shape walkthrough, see our companion piece How to Build a Production AI Agent and the broader landscape comparison in Agent-SDK wars: OpenAI vs Google ADK vs Anthropic.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

How self-hosted sandboxes change the perimeter story

Anthropic shipped self-hosted sandboxes at Code with Claude London on 19 May 2026. This is the piece that closes the loop on the harness story. The SDK gives you budgets, permissions and hooks. The self-hosted sandbox gives you a perimeter for tool execution.

For UK regulated industries — financial services, healthcare, public sector — the perimeter argument was the unblocker. The agent now runs inside your network, with your data residency, behind your firewall. The sandbox enforces the boundary between agent reasoning and your sensitive systems. The same logic applies to Indian regulated workloads where DPDP-mandated data localisation rules out a multi-tenant managed runtime.

The composition is the point. The harness gives you policy. The sandbox gives you isolation. Without either one, the other is half a solution. With both, a London bank can put an agent on customer-support tickets without the compliance team losing sleep, and a Bangalore lender can put one on loan-decisioning workflows that touch DPDP-protected fields. See Claude Managed Agents Beta and Claude Code Agent View for adjacent primitives in the same release cycle.

The week-one production checklist

If you remember nothing else from this piece, run through this checklist before your first production user touches the agent.

  • Per-task budget wired as a pre-tool hook with a hard cap on tokens, cost and tool-call count.
  • Per-user and per-tenant budgets implemented at hourly and daily windows, with circuit-breaker behaviour on breach.
  • Tool scoping on the inbound call, scoped by session, tenant and task — not by filtering the tool list shown to the model.
  • Permission enforcement in a different code path from reasoning, with its own policy file and deployment lifecycle.
  • Four hooks live: pre-tool, post-tool, on-context-overflow, on-error. Each one writes a structured audit log entry.
  • Deployment shape chosen deliberately, with timeouts and streaming behaviour matching the agent's runtime profile.
  • Sandbox perimeter defined — either self-hosted or vendor-managed — with explicit allow-lists for filesystem, network and shell access.
  • Observability emits one span per tool call, one event per hook trigger, one record per budget decision. If you cannot answer "what did the agent do for user X last Tuesday at 14:42?" in under a minute, you are not done.

None of this is glamour work. None of it makes the demo more impressive. All of it is the difference between an agent that ships and an agent that becomes a quarterly incident. The Claude Agent SDK gives you the primitives. The harness is what you build with them.

Primary sources for further reading: the Anthropic platform docs on hosting at platform.claude.com and the Code with Claude docs at code.claude.com. Production-pattern field notes at digitalapplied.com and kenhuangus.substack.com. Harness architecture deep-dive at wavespeed.ai.