What you need to know

  • An agent is four parts — an LLM brain, memory, tools, and a run loop. Get those right and the rest is detail.
  • Scope ruthlessly — ship one workflow, make it reliable, then expand. A broad agent that does ten things badly helps nobody.
  • Tool access is a security surface — unsafe tool access is one of the top reasons first agents fail. Apply least-privilege from day one.
  • Evals are not optional — a measurable eval harness is what stops a prompt tweak quietly breaking last month's behaviour.

Plenty of teams in Bengaluru and London have a working agent demo. Far fewer have an agent that has survived three months of real users. The gap between the two is not model quality — frontier models are more than capable. The gap is engineering discipline: scoping the job tightly, wiring the loop correctly, granting tools carefully, and measuring behaviour continuously. This guide walks through each step in the order you should actually build them.

The four core components of an agent

Strip away the framework branding and every production agent is the same four parts. Understand them as separate concerns and you can debug each one in isolation.

The LLM brain

This is the model that does the reasoning — interpreting the goal, deciding which tool to call, and judging whether the task is done. You do not need the largest model for every step. A common pattern is to route planning and hard reasoning to a strong model such as Claude Sonnet 4.6, and route cheap, high-volume steps — classification, short summaries, simple extraction — to a smaller, faster model such as GPT-5.4 mini or Gemini 2.5 Flash. Treat the model as a swappable component behind an interface, so you can change it when pricing or latency shifts.

Memory

Agents need two kinds of memory. Short-term memory is the working context of the current session — the conversation so far, the last few tool results, the running plan. Long-term memory is a persistent store the agent reads from and writes to across sessions: past resolutions, user preferences, a knowledge base. Keep them separate. Short-term memory belongs in the prompt window; long-term memory belongs in a database or vector store that the agent queries through a tool.

Tools and actions

Tools are how the agent affects the world beyond generating text — calling an internal API, running a web search, querying a database, reading or writing a file. Each tool is a typed function the model can invoke. The quality of your tool definitions — clear names, tight schemas, honest descriptions — matters as much as the prompt. A vague tool is a tool the model will misuse.

The run loop

The loop is the controller that ties the other three together. It feeds the model an observation, lets it reason, executes the tool the model picked, feeds the result back, and repeats until the task is complete or a stopping condition fires. Without a well-designed loop you have a chatbot, not an agent.

The agent loop in detail

The core cycle is observe → reason → act → check. The agent observes the current state, reasons about what to do next, acts by calling a tool, then checks whether the goal is met. If not, it loops again. A slightly richer framing is plan, act, observe, refine: the agent forms a plan, takes one step, observes the result, and refines the plan in light of what it learned. Refinement is the part beginners skip — and skipping it is why agents charge ahead on a stale plan.

Two control mechanisms keep the loop safe. The first is a retry path: when a tool call fails or returns nonsense, the loop should retry with a bounded number of attempts rather than crashing or hallucinating around the failure. The second is escalate-to-human: when the agent is stuck, low-confidence, or about to take an irreversible action, it should hand off to a person. A bank agent in Mumbai handling a disputed transaction, or an NHS-adjacent triage agent in Leeds, must escalate rather than guess. Build both paths before you build features.

Pro tip

Cap the loop with a hard step limit — for example, no more than eight iterations per task. An uncapped loop that gets confused will happily burn money and hit rate limits. The step limit is your circuit breaker; combine it with an escalate-to-human handoff when the cap is reached.

Scoping: ship one workflow first

The single most useful decision you will make is to scope small. Pick one workflow — refund eligibility checks, invoice data extraction, first-line IT support — and make that one workflow genuinely reliable before you add a second. A narrow agent that resolves one job at 95 percent reliability is worth more than a broad agent that attempts ten jobs and is trusted for none.

The build approach you choose should follow from the workflow's stakes and your team's appetite for control. There are three broad options.

Approach Best for Control Trade-off
No-code Linear workflows and fast prototypes — validating an idea this week Low Quick to ship; hard to test rigorously or customise the loop
Semi-code Most real production deployments — a framework plus your own logic Balanced Best speed-to-control ratio; needs engineering ownership
Full-code Regulated domains — Indian banking under RBI, UK financial services under the FCA High Every decision path is auditable; slowest to build and maintain

For a prototype, no-code is the fastest way to learn whether the workflow is even worth automating. For most production deployments, semi-code wins: you take a framework for the loop and orchestration, and write your own code for the parts that matter. If you are weighing frameworks for that semi-code layer, our comparison of LangGraph, CrewAI, PydanticAI and Microsoft's stack breaks down which suits which workload. Full-code is the right call only when a regulator will eventually ask you to explain every decision the agent made.

Safe tool access — least-privilege from day one

Unsafe tool access is one of the top causes of agents failing in production, and it is the failure mode with the worst blast radius. If an agent can call a tool, assume that under some prompt it eventually will. So the question is not whether the agent behaves — it is what the worst plausible tool call can do.

Apply least-privilege. Give each tool the narrowest scope that still does the job. A support agent that needs to read order status should get a read-only query, not write access to the orders table. Scope database credentials to specific rows or tenants. Make irreversible actions — issuing refunds, deleting records, sending external emails — require either a human approval step or a tightly bounded, logged tool with hard limits.

Validate every input the agent passes to a tool. The model's tool arguments are untrusted input: check types, enforce ranges, reject anything outside an allowlist before the call executes. Never concatenate model output straight into a SQL string or a shell command. And log every tool call with its arguments and result, so that when something goes wrong you can reconstruct exactly what the agent did.

Watch out

A tool described as "run a database query" with full credentials is a production incident waiting to happen. Prompt injection from user content or a retrieved document can steer the agent into calling it destructively. Split it into specific, read-scoped tools — get_order_status, list_user_invoices — each with validated arguments and no write path.

Building an eval harness

An eval harness is a repeatable test suite for your agent's behaviour. It is a fixed set of representative tasks, each with a known good outcome, that you run on every change. Without it you are flying blind: a prompt tweak that fixes one case can silently break five others, and you will only find out from an angry user.

Test prompt and context changes against the harness before they reach code review, not after. Treat the eval suite the way you treat unit tests — a change that drops the score does not ship. The KPIs that belong in the harness are measurable and few:

  • Task accuracy — did the agent produce the correct outcome? Target around 95 percent for a workflow you want users to trust.
  • Task-completion rate — did the agent finish without erroring out or escalating? Target around 90 percent.
  • Response time — wall-clock time per task. Slow agents quietly lose users even when they are accurate.
  • Cost per task — total token and tool spend per run. This is the number that decides whether the agent is viable at scale.

The reason this matters in one line: evals prevent regressions. They turn "the agent feels worse since Tuesday" into a number you can see drop in CI. Build the harness with twenty hand-picked cases before you write the agent's second feature — it is cheaper than debugging in production, whether your users are in Pune or Manchester.

Want to discuss this with other verified Builders?

Every profile on AI Tech Connect is a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Why first agents fail in production

The failure modes are predictable, and each maps to a step above:

  1. Unclear scope. An agent asked to handle "customer queries" in general has no reliable success criterion. Narrow the job until success is unambiguous.
  2. Unsafe tool access. Broad credentials and unvalidated tool arguments turn a model slip — or a prompt injection — into real damage. Least-privilege closes this.
  3. Missing evaluations. With no eval harness, you cannot tell whether today's change helped or hurt. Regressions ship unnoticed.
  4. Surprising costs at scale. An agent that costs a few rupees per run in testing can cost lakhs a month at production volume. An uncapped loop multiplies that. Measure cost per task in the harness from the start.

None of these are model problems. They are engineering-discipline problems, which is good news — discipline is something you control. As agent tooling matures, the same discipline scales to multi-session work; our look at running many agent sessions from one CLI shows where this goes once a single workflow is solid.

What to do this week

You do not need a quarter to make progress. This week: pick exactly one workflow and write its success criterion in a single sentence. Sketch the observe-reason-act loop on paper, including the retry path and the escalate-to-human handoff. List every tool the agent needs and write down the narrowest scope each can have. Then hand-pick twenty representative test cases with known good outcomes — that is your first eval harness. Build the agent against that harness, no-code for the prototype if it helps you learn faster. Ship the one workflow, watch the four KPIs, and only then add the second. Reliability first, breadth later.