Why single-agent architectures hit a ceiling

A single agent running in a long context window can handle a surprising range of tasks, but three constraints reliably force builders toward multi-agent designs once workloads grow beyond toy complexity.

The first is context limits. Even with models offering 128k or 1M-token windows, shoving an entire workflow's history, tool outputs, and instructions into one context is expensive and fragile. Retrieval quality degrades as context grows, and the cost-per-task scales linearly with context size. A multi-agent design where each specialist receives only the context it needs is both cheaper and more accurate.

The second is specialisation. A general-purpose agent prompted to "research, then draft, then critique, then execute" performs worse than a pipeline of purpose-built agents, each with a tightly scoped system prompt, a curated tool set, and a narrow responsibility. The same principle that makes human teams more effective than individual generalists applies to agent teams.

The third is parallelism. Some steps in a workflow are genuinely independent: validating two different sections of a legal document, querying three separate data sources, or running unit tests on separate modules of generated code. A single sequential agent cannot exploit this parallelism. A LangGraph pipeline with fan-out edges can dispatch multiple agents simultaneously and join their results, cutting wall-clock time substantially for parallelisable workloads.

The hierarchical retrieval work from April 2026 reinforces this: the biggest accuracy gains in retrieval-augmented systems came not from better retrieval algorithms but from routing queries to specialist agents with domain-specific retrieval configurations.

Short-term memory: State objects and checkpointing

In LangGraph, short-term memory is the State object — a typed dictionary that flows through the graph and accumulates updates as each node writes to it. Every node receives the current state, does its work, and returns a partial update. LangGraph merges these updates using the reducer functions you define (or its defaults) to produce the next state.

Checkpointing persists this state to a backend between node executions, so the graph can survive restarts, honour human-in-the-loop pauses (covered in depth in our LangGraph v0.4 HITL guide), and allow replaying from any prior checkpoint. The state object is therefore both the message-passing mechanism between agents and the durable memory within a single run.

A key design discipline is keeping state schema deliberately narrow. Fields that only one agent needs should live in that agent's subgraph state, not in the global state. Global state should contain only what must be shared across agents: the original task, accumulated results, routing decisions, and error signals.

Long-term memory: vector stores, Redis, and episodic patterns

Short-term memory dies when the thread ends. Long-term memory is the mechanism for retaining knowledge across separate runs, separate users, and separate sessions. Three patterns dominate production deployments:

Vector store retrieval

Embed past interactions, domain documents, or learned preferences into a vector store — Pinecone, pgvector on Postgres, or Weaviate. At the start of each run, a memory-retrieval node queries the store for the k most semantically relevant past records and prepends them to the agent's context. This is the foundation of agentic RAG and is well-supported by LangGraph's ToolNode pattern: the retrieval is modelled as a tool call, keeping the graph structure clean.

For Indian customer-support automation teams, this pattern lets a support agent "remember" that a particular enterprise customer always prefers responses in a formal register, prefers WhatsApp escalation over email, and has an SLA of two-hour first response — without any of that being in the incoming query.

Redis for fast key-value recall

For structured facts — user preferences, account status, recent actions — a Redis key-value store is faster and cheaper than vector retrieval. A memory node near the graph entry point fetches a set of keys from Redis (e.g. user:{id}:preferences, user:{id}:last_action) and writes them into the state before specialist agents run. TTL-based expiry keeps the store lean without requiring explicit invalidation logic.

Episodic memory patterns

Episodic memory goes further: rather than retrieving raw past documents, a memory-consolidation agent periodically summarises past sessions into compact "episode" records and stores those summaries. At retrieval time, the agent loads episode summaries rather than raw conversation history, dramatically reducing token usage while preserving the semantic content of prior interactions. This pattern is particularly valuable for multi-turn workflows that span days — legal matter management in UK law firms, or ongoing code-review pipelines in Indian product companies.

Role-based agent specialisation

The most common multi-agent architecture in enterprise LangGraph deployments is a four-role pipeline: researcher, drafter, critic, and executor. Each role maps to a LangGraph node or subgraph.

The researcher agent receives the task and a retrieval tool set. Its job is to gather relevant context — from vector memory, web search, or internal databases — and write a structured research summary to the shared state. It does not generate user-facing output.

The drafter agent receives the research summary and the original task, and produces a first draft of the output. It has access to a generation tool set but no retrieval tools — its job is focused synthesis, not information gathering.

The critic agent receives the draft and applies a domain-specific rubric. In a legal-document workflow (common in UK firms doing contract review), the critic checks for compliance with specific clauses, flags missing boilerplate, and scores the draft. In a code-review pipeline (common in Indian product teams), it checks for security vulnerabilities, style violations, and test coverage. The critic does not rewrite; it annotates and scores.

The executor agent receives the annotated draft and applies revisions. If the critic score exceeds a threshold, it finalises the output. If not, it routes back to the drafter — a revision loop enforced by a conditional edge in the graph.

The orchestrator (supervisor) node sits above these four agents and routes state based on the current phase. It is a lightweight LLM call or, better, a deterministic rule-based router that reads a phase field from the state.

Builder perspective

"We replaced a single 80k-token context prompt with a four-agent pipeline — researcher, drafter, legal critic, and executor. Accuracy on compliance checks went from 67% to 91%, and cost per document dropped by 38% because the researcher and critic nodes use a smaller model. The orchestration overhead is negligible."

— A Verified Builder · London, UK

Orchestration patterns: sequential, parallel, hierarchical

LangGraph supports three orchestration topologies, and the right choice depends on task structure:

Pattern Structure Best for Example use case
Sequential Linear chain of nodes Dependent steps where each stage needs the prior output Research → Draft → Review → Send
Parallel Fan-out edges + join node Independent sub-tasks that can run concurrently Validate three contract clauses simultaneously
Hierarchical Supervisor + specialist subgraphs Complex workflows with multiple specialist domains Customer support triage routing to billing, tech, and compliance subagents

The hierarchical pattern is the most powerful and the most operationally demanding. A supervisor node receives the incoming task and routes it to one of several subgraph agents, each of which is itself a StateGraph compiled with its own nodes, edges, and tool set. The supervisor also handles error recovery: if a subagent returns an error state, the supervisor can retry with a different agent, escalate to a human-in-the-loop checkpoint, or fail gracefully.

For an Indian customer-support automation deployment at scale, a hierarchical orchestrator might route incoming queries to a billing subagent (with access to payment APIs), a technical subagent (with access to the product knowledge base), and a compliance subagent (with access to DPDP and consumer protection policy documents) — all within a single LangGraph application.

Human-in-the-loop checkpoints

Any production multi-agent system handling consequential decisions — loan approvals, legal document filing, code deployment — needs human-in-the-loop gates. LangGraph v0.4's interrupt/resume pattern is covered in detail in our dedicated HITL guide; here we note the multi-agent-specific consideration.

In a hierarchical pipeline, interrupts can originate from any subagent, not just the top-level graph. The supervisor receives the interrupt bubbled up from the subagent and surfaces it to the calling application. When the human responds, the supervisor's Command(resume=...) call routes the response back down to the originating subagent. LangGraph v0.4's multi-interrupt resume — resolving multiple parallel interrupts in a single Command call — is essential for pipelines where parallel branches each require human approval before the join node can proceed.

UK legal-document workflows benefit here: a parallel-validation pipeline might run three specialist critics simultaneously, each raising an interrupt if they find a material compliance issue. The supervising lawyer reviews all three flagged issues in a single interface, approves or amends, and the pipeline resumes from all three branches simultaneously.

MCP integration: standard tool interfaces across agents

The Model Context Protocol (MCP) defines a standard interface for exposing tools, prompts, and resources to language models. In a multi-agent LangGraph system, MCP is the answer to a persistent integration problem: each agent needs tools, and writing bespoke integration code for every tool-agent combination does not scale.

With MCP, a tool server — a web-search service, a database, a code-execution sandbox, a vector retrieval endpoint — exposes a standard JSON-RPC interface. Any MCP-compliant client can connect to any MCP-compliant server without custom glue code. In LangGraph terms, you equip each agent's ToolNode with an MCP client, and that client discovers available tools from the server at initialisation time.

LangChain's load_mcp_tools() helper returns MCP tools as BaseTool instances compatible with LangGraph's ToolNode. This means a researcher agent and an executor agent running in the same pipeline can connect to the same MCP server — a company-internal knowledge base, say — without any shared code between them beyond the server address.

See our Claude Managed Agents guide for how Anthropic's managed agent runtime also exposes tools via MCP, making it straightforward to mix LangGraph-orchestrated agents with Claude-managed subagents in the same pipeline.

When multi-agent architectures outperform single-agent ones

Multi-agent pipelines with memory consistently outperform single-agent equivalents on complex, multi-step tasks that require recall of prior context — summarisation pipelines that must reference earlier findings, code-review workflows that accumulate per-PR heuristics, and customer support flows that need to remember a user's prior issue history. The gains are task-dependent: the more a task rewards continuity and specialisation, the greater the benefit.

The researchers attributed the gains to two primary factors: specialist agents outperforming generalists on domain-specific sub-tasks (particularly critic and validation roles), and long-term memory retrieval compensating for context-limit degradation on complex, multi-document tasks. This aligns with what builders deploying agentic RAG have observed — see our coverage of hierarchical retrieval patterns for the retrieval-side details.

The caveat is that these gains apply to carefully designed pipelines, not naive multi-agent wrapping. Poorly designed orchestration — where agents duplicate work, state accumulates noise, or routing logic is ambiguous — can actually degrade performance versus a well-prompted single agent. The architecture matters as much as the framework.

Production considerations

Shipping a multi-agent LangGraph pipeline to production requires planning for three operational realities that do not surface in development:

State serialisation

LangGraph serialises State objects to JSON for checkpoint persistence. Any field that contains a non-serialisable object — a Pydantic model with custom validators, a NumPy array, a database connection — will cause silent data loss or a serialisation error at checkpoint time. Audit your state schema before production deployment: all fields should be primitive types, lists, or dictionaries of primitives. Use Annotated types with custom serialisers for domain objects that must live in state.

Timeout handling

Each agent hop has a latency profile. A four-agent sequential pipeline where each node calls a hosted LLM can accumulate 10–30 seconds of wall-clock time per run. In parallel pipelines, the bottleneck is the slowest branch. Set per-node timeouts explicitly using RunnableConfig rather than relying on the LLM client's default timeout — LLM client defaults are often too generous for production SLAs. Implement retry logic at the supervisor level, not inside agent nodes, so retry state is visible in the checkpoint history.

Cost per agent hop

Every LLM call in a multi-agent pipeline is a cost event. A four-node pipeline calling GPT-4o at each step costs approximately 4x the token cost of a single-agent call for the same task, plus the Postgres or Redis checkpoint overhead. Mitigation strategies used by production builders:

  • Use a smaller, cheaper model (GPT-4o-mini, Haiku) for routing decisions and critic rubric evaluation; reserve the larger model for the drafter and executor roles.
  • Cache frequent vector retrievals in Redis with a 15-minute TTL. Most enterprise workflows retrieve the same domain documents repeatedly across concurrent runs.
  • Set a token budget enforced at the supervisor level: if accumulated token spend for a run exceeds a threshold, the supervisor routes to a "summarise and stop" node rather than continuing to the next agent.
  • Batch independent agent calls where the framework supports it. LangGraph's parallel branches naturally batch — ensure your LLM provider client is configured for concurrent requests.

Code example: StateGraph with short- and long-term memory

The following example wires together a four-agent pipeline with both in-state short-term memory (via the State object and a Postgres checkpointer) and external long-term memory (via a Redis-backed user-preferences store and a pgvector retrieval tool).

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
from langchain_core.messages import BaseMessage
from typing import TypedDict, Annotated
import operator
import redis.asyncio as aioredis

# ── State schema ────────────────────────────────────────────────────────────
class AgentState(TypedDict):
    task: str
    user_id: str
    user_prefs: dict            # populated by memory loader; short-term within run
    research_summary: str       # researcher agent output
    draft: str                  # drafter agent output
    critique: dict              # critic agent output: {"score": float, "notes": str}
    final_output: str           # executor agent output
    messages: Annotated[list[BaseMessage], operator.add]  # accumulates across nodes
    iteration: int              # tracks revision loops

# ── Memory loader ────────────────────────────────────────────────────────────
redis_client = aioredis.from_url("redis://localhost:6379")

async def memory_loader_node(state: AgentState) -> dict:
    """Fetch long-term user preferences from Redis before specialist agents run."""
    uid = state["user_id"]
    prefs_raw = await redis_client.hgetall(f"user:{uid}:prefs")
    prefs = {k.decode(): v.decode() for k, v in prefs_raw.items()} if prefs_raw else {}
    return {"user_prefs": prefs, "iteration": 0}

# ── Researcher agent ─────────────────────────────────────────────────────────
async def researcher_node(state: AgentState) -> dict:
    """Retrieve relevant context from pgvector and summarise."""
    # retrieval_tool is a LangChain tool wrapping a pgvector similarity search
    context_docs = await retrieval_tool.ainvoke(state["task"])
    summary = await research_llm.ainvoke(
        f"Summarise for task: {state['task']}\n\nContext:\n{context_docs}"
    )
    return {"research_summary": summary.content}

# ── Drafter agent ────────────────────────────────────────────────────────────
async def drafter_node(state: AgentState) -> dict:
    """Produce a draft using research summary and user preferences."""
    prompt = (
        f"Task: {state['task']}\n"
        f"Research: {state['research_summary']}\n"
        f"User preferences: {state['user_prefs']}\n"
        "Produce a complete draft."
    )
    draft = await draft_llm.ainvoke(prompt)
    return {"draft": draft.content}

# ── Critic agent ─────────────────────────────────────────────────────────────
async def critic_node(state: AgentState) -> dict:
    """Score draft and provide revision notes."""
    evaluation = await critic_llm.ainvoke(
        f"Evaluate this draft for quality and accuracy.\nDraft: {state['draft']}"
    )
    # Parse structured score from evaluation
    import json
    try:
        critique = json.loads(evaluation.content)
    except Exception:
        critique = {"score": 0.5, "notes": evaluation.content}
    return {"critique": critique}

# ── Executor agent ───────────────────────────────────────────────────────────
async def executor_node(state: AgentState) -> dict:
    """Apply critic notes and produce final output."""
    prompt = (
        f"Original draft: {state['draft']}\n"
        f"Critic notes: {state['critique']['notes']}\n"
        "Produce the final revised version."
    )
    final = await executor_llm.ainvoke(prompt)
    return {"final_output": final.content}

# ── Routing logic ─────────────────────────────────────────────────────────────
def route_after_critic(state: AgentState) -> str:
    if state["critique"]["score"] >= 0.85 or state["iteration"] >= 2:
        return "executor"
    return "drafter"  # revision loop

# ── Graph assembly ────────────────────────────────────────────────────────────
builder = StateGraph(AgentState)
builder.add_node("memory_loader", memory_loader_node)
builder.add_node("researcher",    researcher_node)
builder.add_node("drafter",       drafter_node)
builder.add_node("critic",        critic_node)
builder.add_node("executor",      executor_node)

builder.set_entry_point("memory_loader")
builder.add_edge("memory_loader", "researcher")
builder.add_edge("researcher",    "drafter")
builder.add_edge("drafter",       "critic")
builder.add_conditional_edges("critic", route_after_critic, {
    "executor": "executor",
    "drafter":  "drafter",
})
builder.add_edge("executor", END)

# ── Compile with Postgres checkpointer (short-term persistent memory) ─────────
checkpointer = AsyncPostgresSaver.from_conn_string(DATABASE_URL)
graph = builder.compile(checkpointer=checkpointer)
Pro tip

The revision loop via route_after_critic is bounded by iteration <= 2 to prevent infinite cycles. In production, always set a maximum iteration count and log when the bound is hit — a draft that fails the critic three times is a signal that the task is ambiguous or the critic rubric needs calibration, not that more iterations will help.

Architecture diagram: orchestrator to specialist agents

The flow for a hierarchical multi-agent deployment looks like this:

Incoming task
      │
      ▼
Memory Loader ──── Redis (user prefs, long-term KV)
      │             pgvector (episodic summaries)
      ▼
  Supervisor / Orchestrator
      │
      ├──────────────────────────────────────┐
      ▼                                      ▼
Researcher Agent                   (parallel branch 2)
  └── pgvector retrieval tool         Additional specialist
  └── Web search MCP tool
      │
      ▼
Drafter Agent
  └── Generation LLM
      │
      ▼
Critic Agent
  └── Rubric evaluation LLM
  └── [HITL checkpoint if score < threshold]
      │
      ▼ (conditional: pass → executor / fail → drafter)
Executor Agent
  └── Finalise + write-back to long-term store
      │
      ▼
     END

The Memory Loader and Executor nodes bookend the pipeline: Memory Loader hydrates state from long-term stores before specialist agents run; Executor writes the completed output back to the long-term store (e.g. saving an episode summary to pgvector) so future runs can retrieve this interaction. This write-back step is the mechanism that makes memory accumulate over time rather than remaining static.