The memory problem: why agents forget and hallucinate
Picture a support agent that has been helping a customer through a complex insurance claim for forty minutes. Early in the conversation, the customer mentioned a policy number, a claim date, and a preferred callback time. Twenty turns later, the agent confidently asks for the policy number again. The model has not malfunctioned — it simply cannot see that far back. The context window filled with reasoning steps, tool outputs, and clarifying questions, and the original details fell off the edge.
This is the central memory problem in production agents, and it has a less obvious sibling: hallucination by confabulation. When an agent loses access to earlier context, it does not fail loudly — it fills the gap with plausible-sounding content generated from its training weights. It might produce a policy number that looks correct but is not, or confidently assert a fact it retrieved two sessions ago but has since been unable to re-retrieve. The failure mode is worse than a clean error because it is invisible without explicit verification.
Long-horizon tasks compound both problems. A background agent running a nightly compliance check, a multi-step coding agent working across a large codebase, a research agent that has been accumulating evidence over hours — all of these eventually exhaust naive context management. The agent that started with clear intent ends in a confused state, partly because of what it remembers and partly because of what it incorrectly thinks it remembers.
The most dangerous memory failure is not a hard crash — it is silent confabulation. When the context window fills and the agent loses earlier facts, it does not stop and say so. It continues confidently, filling gaps from training data. Instrument your agents with explicit memory boundaries and log what was summarised or evicted; otherwise confabulation looks identical to correct recall until a human checks the output.
The solution is not a bigger context window, though that helps at the margins. The solution is deliberate memory architecture: deciding what to keep in the window, what to externalise, and what to retrieve on demand. The cognitive science framing borrowed from human memory — working, episodic, semantic, and procedural — turns out to map cleanly onto agent implementation choices, and it gives a vocabulary for reasoning about tradeoffs that pure engineering terminology struggles to provide.
Four types of agent memory
The taxonomy is useful precisely because each type corresponds to a different storage location, retrieval mechanism, and cost profile. Understanding all four — and when each one applies — is the difference between an agent that degrades gracefully under long-horizon pressure and one that quietly confabulates its way to a wrong answer.
| Memory type | What it stores | Where it lives | Access pattern | Latency |
|---|---|---|---|---|
| Working memory | The live conversation: messages, tool outputs, intermediate reasoning | In the model's context window (in-process) | Automatic — everything in the window is visible; managed by compression and eviction | Zero (already in context) |
| Episodic memory | Past events: previous turns, prior sessions, decisions made, outcomes seen | External vector store or relational DB (Supabase, Redis, Postgres) | Explicit retrieval by recency or semantic similarity; injected into the prompt | 10–100 ms per retrieval |
| Semantic memory | Factual knowledge: domain facts, product data, policies, reference content | Vector database (Supabase pgvector, Pinecone, Weaviate) | Semantic search at query time; top-k chunks injected into prompt | 20–200 ms per query |
| Procedural memory | How to do things: available tools, prompt templates as skills, sub-agent patterns | Tool registry (Python dict, MCP server, prompt library) | Deterministic lookup by name; loaded at agent initialisation or on demand | Sub-millisecond (in-process) |
A production agent rarely uses just one type. A conversational assistant will lean heavily on working and episodic memory. A knowledge-intensive research agent will depend on semantic memory. A workflow orchestrator running over days will need all four. The art is knowing which combination fits your agent's task profile and where to invest engineering effort first.
Working memory: managing the context window
Working memory is the context window, and every message you add is a token cost. The naive approach — keep the full conversation history forever — breaks in two ways: the window eventually fills and the oldest messages are silently dropped by the API, or the growing context adds cost and latency to every turn even when most of it is no longer relevant to the current step.
The concept of a context budget reframes this usefully. Instead of thinking about the context window as a passive container, treat it as a fixed resource that must be actively managed. Decide at the start of an agent session what fraction of the window you want to allocate to conversation history, tool outputs, system prompt, and injected memory from external stores. When history exceeds its budget, compress it. When tool outputs are large, summarise them before appending. When retrieved knowledge grows stale mid-task, evict and re-retrieve.
The most reliable compression technique is rolling summarisation: after each assistant turn, count the current token usage; when it crosses the budget threshold, call the model once more to compress the oldest segment of the history into a concise summary, then replace those messages with the summary. The conversation looks shorter but retains the key decisions and facts.
import anthropic
client = anthropic.Anthropic()
CONTEXT_BUDGET_TOKENS = 60_000 # leave headroom for the system prompt and new turns
SUMMARISE_AT = 0.75 # compress when 75% of budget is used
COMPRESS_OLDER_THAN = 10 # keep the 10 most recent turns verbatim; summarise the rest
def estimate_tokens(messages: list[dict]) -> int:
"""Rough token estimate: 4 chars ≈ 1 token. Replace with tiktoken for accuracy."""
return sum(len(str(m.get("content", ""))) for m in messages) // 4
def rolling_summarise(messages: list[dict], system_prompt: str) -> list[dict]:
"""
Compress the oldest messages into a single summary when the context budget is near.
Returns a new message list: [summary_message] + recent_messages[-COMPRESS_OLDER_THAN:]
"""
if len(messages) <= COMPRESS_OLDER_THAN:
return messages
older = messages[:-COMPRESS_OLDER_THAN]
recent = messages[-COMPRESS_OLDER_THAN:]
# Ask a cheap model to compress the older segment — haiku is ideal here.
summary_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
system=(
"You are a conversation summariser. Produce a concise factual summary "
"of the conversation below, preserving all decisions made, key facts stated, "
"and any unresolved questions. Write in third person, past tense."
),
messages=[
{
"role": "user",
"content": f"Summarise this conversation segment:\n\n{older}",
}
],
)
summary_text = summary_response.content[0].text
summary_message = {
"role": "user",
"content": f"[Conversation summary — {len(older)} earlier turns compressed]\n{summary_text}",
}
return [summary_message] + recent
def agent_turn(messages: list[dict], user_input: str, system_prompt: str) -> tuple[str, list[dict]]:
"""
Single agent turn with rolling summarisation.
Returns (assistant_reply, updated_messages).
"""
messages = messages + [{"role": "user", "content": user_input}]
# Check budget before calling the main model
if estimate_tokens(messages) > CONTEXT_BUDGET_TOKENS * SUMMARISE_AT:
messages = rolling_summarise(messages, system_prompt)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
messages=messages,
)
assistant_reply = response.content[0].text
messages = messages + [{"role": "assistant", "content": assistant_reply}]
return assistant_reply, messages
A few notes on this pattern. Using claude-haiku-4-5-20251001 for summarisation is intentional — it is faster and cheaper than the main model, and summarisation is a well-scoped task that does not need frontier capability. The main conversation runs on claude-sonnet-4-6. Keep the most recent turns verbatim and only summarise the older segment; the model needs the immediate context intact to reason coherently about the current step. Finally, log every summarisation event — you will want to audit what was compressed when debugging a confabulation.
Working memory is cheap but lossy. Summarisation preserves meaning but discards exact wording, timestamps, and the texture of earlier exchanges. For anything that might be referenced precisely later — a quoted policy clause, a specific number, a decision with a rationale — write it to episodic memory before it risks being compressed.
Episodic memory: external stores and retrieval
Episodic memory solves the durability problem that working memory cannot: facts and events that need to survive across turns, sessions, or even agent restarts. The implementation pattern is straightforward — write important events to an external store as they happen, retrieve relevant ones at the start of each turn and inject them into the prompt.
The storage substrate is typically a vector database, because retrieval by semantic similarity ("what have I learned about this customer's billing preferences?") is more useful than retrieval by exact key in most agent scenarios. Supabase with pgvector is the natural choice for teams already running Postgres — you get vector similarity search alongside your relational tables, no additional service to manage, and availability in both AWS Mumbai (ap-south-1) and London (eu-west-2) regions. This is especially practical for Indian and UK teams who need data residency compliance alongside agent infrastructure in the same cloud region.
A well-designed episodic memory store combines two retrieval axes: recency (what happened most recently?) and relevance (what is semantically similar to the current query?). Recency alone misses important but older facts; relevance alone may surface a memory from a very different context that happens to share vocabulary. A weighted combination of both — or a simple recency-filtered semantic search — works well for most agent use cases.
import anthropic
from supabase import create_client, Client
from datetime import datetime, timezone
client = anthropic.Anthropic()
# Supabase setup — set SUPABASE_URL and SUPABASE_KEY in your environment
supabase: Client = create_client("YOUR_SUPABASE_URL", "YOUR_SUPABASE_KEY")
EMBEDDING_MODEL = "text-embedding-3-small" # or use Anthropic embeddings if available
def embed_text(text: str) -> list[float]:
"""Get embeddings for a text string. Use your preferred embedding provider."""
import openai # swap for any embedding provider
response = openai.embeddings.create(model=EMBEDDING_MODEL, input=text)
return response.data[0].embedding
def write_memory(agent_id: str, content: str, memory_type: str = "episodic") -> None:
"""
Write an event or fact to the episodic memory store.
Call this after each significant agent action: decisions made, facts confirmed,
user preferences stated, errors encountered.
"""
embedding = embed_text(content)
supabase.table("agent_memories").insert({
"agent_id": agent_id,
"content": content,
"memory_type": memory_type,
"embedding": embedding,
"created_at": datetime.now(timezone.utc).isoformat(),
}).execute()
def retrieve_memories(
agent_id: str,
query: str,
top_k: int = 5,
recency_weight: float = 0.3,
) -> list[dict]:
"""
Retrieve the most relevant episodic memories for the current query.
Uses pgvector cosine similarity in Supabase.
The recency_weight blends semantic score with recency rank; adjust to taste.
"""
query_embedding = embed_text(query)
# Supabase RPC call to a stored procedure that does the combined retrieval.
# The SQL for agent_memory_search is shown in the comment below.
result = supabase.rpc(
"agent_memory_search",
{
"query_embedding": query_embedding,
"agent_id_filter": agent_id,
"match_count": top_k * 3, # over-fetch, then re-rank by blended score
},
).execute()
memories = result.data or []
# Blend similarity score with recency rank (0 = most recent)
for i, mem in enumerate(sorted(memories, key=lambda m: m["created_at"], reverse=True)):
mem["recency_rank"] = i
for mem in memories:
mem["blended_score"] = (
(1 - recency_weight) * mem.get("similarity", 0)
+ recency_weight * (1 / (1 + mem["recency_rank"]))
)
memories.sort(key=lambda m: m["blended_score"], reverse=True)
return memories[:top_k]
# SQL for the Supabase stored procedure (create once in your migration):
# CREATE OR REPLACE FUNCTION agent_memory_search(
# query_embedding vector(1536),
# agent_id_filter text,
# match_count int DEFAULT 20
# )
# RETURNS TABLE (id uuid, content text, created_at timestamptz, similarity float)
# LANGUAGE plpgsql AS $$
# BEGIN
# RETURN QUERY
# SELECT id, content, created_at,
# 1 - (embedding <=> query_embedding) AS similarity
# FROM agent_memories
# WHERE agent_id = agent_id_filter
# ORDER BY embedding <=> query_embedding
# LIMIT match_count;
# END; $$;
Episodic memory adds latency — typically 20–80 ms per retrieval call — but this is usually acceptable given what it buys: persistent context across sessions and the ability to reference decisions made hours or days ago without keeping the entire history in the context window. The key discipline is writing selectively. Do not write every assistant turn to episodic memory; write the events that matter — decisions, stated preferences, confirmed facts, encountered errors. Noisy memory stores degrade retrieval quality just as much as noisy context windows.
Semantic memory: vector databases and knowledge graphs
Semantic memory is the agent's knowledge of the world: domain facts, product catalogues, policy documents, reference data. Unlike episodic memory (what happened), semantic memory is about what is true — the kind of knowledge that does not change with each conversation but must be current and accurate when queried.
The implementation pattern is identical to retrieval-augmented generation (which we cover in depth in our production RAG guide): chunk the knowledge corpus, embed it, store it in a vector database, and retrieve the top-k most relevant chunks at query time. The choice of vector database has real tradeoffs in a production agent context:
| Option | Latency (p50) | Cost model | Managed vs self-host | Best fit |
|---|---|---|---|---|
| Supabase pgvector | 10–40 ms | Included with Postgres plan; no per-query fee | Fully managed; available in ap-south-1 (Mumbai) and eu-west-2 (London) | Teams already on Postgres; <1M vectors; data residency requirements |
| Pinecone | 5–20 ms | Per-vector-stored + per-query; serverless tier available | Fully managed; no self-hosting option | Scale to tens of millions of vectors; want zero infrastructure management |
| Weaviate | 15–60 ms | Self-host free; Weaviate Cloud priced by resource | Self-hostable (Docker/K8s) or managed cloud | Multi-modal data; schema-enforced knowledge graphs; compliance needs requiring on-prem |
For most Indian and UK teams building their first production agent, Supabase pgvector is the pragmatic starting point: it sits in the same Postgres instance as the rest of your data, region selection aligns with data localisation requirements (important for both the Digital Personal Data Protection Act in India and UK GDPR), and there is no cold-start cost or separate service to operate. When vector count grows past the comfortable pgvector range — roughly one to two million vectors with an HNSW index — Pinecone's managed scaling starts to pay for itself in operational simplicity.
Semantic memory introduces a cold-start cost that working and episodic memory do not: you must index the knowledge corpus before any agent can use it. For a team building a customer-support agent over a product catalogue, that means chunking, embedding, and indexing every document before launch, then maintaining a refresh pipeline to keep the index current as the catalogue changes. The cost is front-loaded, but the payoff is that every agent instance has access to the full knowledge base from turn one, without needing to store it in the context window.
Attach a last_updated timestamp to every chunk at index time and filter retrievals to exclude chunks older than your acceptable staleness threshold. An agent that confidently answers from a policy that changed last month is a compliance risk. The filter is a single SQL condition — WHERE last_updated > NOW() - INTERVAL '30 days' — and it is far cheaper than discovering stale facts after they have reached a user.
Procedural memory: tool registries and skill caches
Procedural memory is the most overlooked of the four types, yet it is the one that most directly determines what an agent is capable of doing. Procedural memory stores the agent's knowledge of how to act: what tools are available, under what conditions to use them, and what the standard operating procedures are for common tasks.
The simplest implementation is a tool registry: a Python dictionary mapping tool names to their callable functions, input schemas, and descriptions. The model reads the descriptions to decide when to invoke a tool; the registry dispatches the call. This is exactly the pattern the Anthropic tool-use API uses under the hood, and keeping your registry structured and well-documented is free architecture documentation for the model as well as for your team.
from dataclasses import dataclass, field
from typing import Callable, Any
import json
@dataclass
class Tool:
name: str
description: str
input_schema: dict
fn: Callable[..., Any]
def to_api_spec(self) -> dict:
"""Convert to the format expected by the Anthropic messages API."""
return {
"name": self.name,
"description": self.description,
"input_schema": self.input_schema,
}
class ToolRegistry:
"""
Procedural memory: a registry of callable tools with their descriptions.
Tools registered here are surfaced to the model as its available skills.
Register new tools at startup; load domain-specific skills dynamically.
"""
def __init__(self):
self._tools: dict[str, Tool] = {}
def register(self, tool: Tool) -> None:
self._tools[tool.name] = tool
def api_specs(self) -> list[dict]:
"""Return tool specs in the format expected by the Anthropic API."""
return [t.to_api_spec() for t in self._tools.values()]
def dispatch(self, tool_name: str, tool_input: dict) -> Any:
"""Execute a tool by name. Raises KeyError if tool not found."""
if tool_name not in self._tools:
raise KeyError(f"Unknown tool: {tool_name!r}. Available: {list(self._tools)}")
return self._tools[tool_name].fn(**tool_input)
# Example: build a registry for a customer-support agent
registry = ToolRegistry()
registry.register(Tool(
name="lookup_order_status",
description=(
"Look up the current status of a customer order by order ID. "
"Use when the customer asks about their order, delivery, or shipment."
),
input_schema={
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "The order identifier, e.g. ORD-123456"}
},
"required": ["order_id"],
},
fn=lambda order_id: {"status": "dispatched", "eta": "2026-06-12"}, # replace with real impl
))
registry.register(Tool(
name="escalate_to_human",
description=(
"Escalate the conversation to a human support agent. "
"Use when the customer is distressed, when the issue requires account-level access, "
"or when you have attempted two resolutions without success."
),
input_schema={
"type": "object",
"properties": {
"reason": {"type": "string", "description": "Brief reason for escalation"},
"priority": {"type": "string", "enum": ["normal", "urgent"]},
},
"required": ["reason", "priority"],
},
fn=lambda reason, priority: {"ticket_id": "TKT-7890", "queued": True},
))
# Use the registry in an agent loop
import anthropic
agent_client = anthropic.Anthropic()
def run_agent(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
response = agent_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=registry.api_specs(),
messages=messages,
)
# Handle tool use in a simple loop
while response.stop_reason == "tool_use":
tool_uses = [b for b in response.content if b.type == "tool_use"]
tool_results = []
for tu in tool_uses:
result = registry.dispatch(tu.name, tu.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": tu.id,
"content": json.dumps(result),
})
messages = messages + [
{"role": "assistant", "content": response.content},
{"role": "user", "content": tool_results},
]
response = agent_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=registry.api_specs(),
messages=messages,
)
return response.content[0].text
The Model Context Protocol (MCP) takes the tool registry pattern one step further: instead of a Python dictionary scoped to a single process, tools are registered in a standard MCP server that any compatible agent can discover and call, regardless of language or framework. This is procedural memory made portable. If your agent needs to call tools hosted by other services — a payments provider, a CRM, an internal data platform — an MCP server lets you add those tools to any agent's registry without code changes to the agent itself. Our FastMCP tutorial covers the full implementation.
Treat tool descriptions as first-class documentation. The model's decision about when and how to call a tool depends entirely on your description. Be specific about preconditions ("use only when the customer has provided an order ID"), side effects ("this will charge the customer's card"), and failure modes ("returns null if the order ID is not found"). A well-described tool registry is the single cheapest reliability improvement available to a tool-calling agent.
Choosing the right pattern for your agent
The four memory types are not always used together, and the right combination depends on the agent's task profile, interaction pattern, and durability requirements. The framework below maps the four common agent archetypes — conversational chatbot, task agent, background worker, and multi-agent system — to their recommended memory architecture.
Before reaching for the full stack, ask three questions: How long does a single agent session last? Does context need to survive across sessions? Does the agent need external knowledge that changes independently of the conversation? The answers determine which memory types earn their implementation cost.
| Agent type | Working memory | Episodic memory | Semantic memory | Procedural memory |
|---|---|---|---|---|
| Conversational chatbot (support, FAQ, booking) | Essential — rolling summarisation when sessions > 15 turns | High value — persist user preferences and prior issues across sessions | High value — retrieval over product docs, policies, FAQs | Moderate — lookup and escalation tools; simple registry |
| Task agent (coding, writing, analysis — single session) | Essential — compression required for long tasks; protect intermediate outputs | Low — task typically starts fresh; write a summary at completion | Moderate — project-specific knowledge if codebase or docs are large | High — rich tool registry (file read/write, search, code execution) |
| Background worker (scheduled jobs, event-driven pipelines) | Minimal — each invocation is short; no rolling summarisation needed | Essential — deduplication, state tracking across runs, audit log | High — current reference data (prices, policies, compliance rules) | Essential — stable tool registry between invocations; MCP for external services |
| Multi-agent system (orchestrator + sub-agents) | Per-agent — each agent manages its own window; orchestrator compresses sub-agent summaries | Shared store — sub-agents write events; orchestrator reads across all agents | Shared index — one semantic store queried by multiple agents | Federated — orchestrator exposes MCP server; sub-agents discover tools dynamically |
For a conversational chatbot the priority order is: episodic first (without it, every session starts from zero and users repeat themselves), then semantic (without it, the model answers from training weights rather than your current product data), then working memory management as session length grows. Procedural memory is usually simple enough that a handful of hard-coded tools suffices at launch.
For a background worker the order inverts: procedural and semantic are load-bearing from day one, because the agent must know what tools it has and must retrieve current data to do useful work. Episodic memory is essential for deduplication and state tracking — without it a nightly job that failed halfway through will re-process every item on the next run. Working memory is rarely the constraint because each invocation is short.
For multi-agent systems, shared episodic and semantic memory stores are the coordination mechanism. Sub-agents that cannot read each other's event history will duplicate work and contradict each other. Design the shared memory interface early — what events sub-agents write, what keys they use, and how the orchestrator aggregates — because retrofitting it is expensive. See our LangGraph state management guide and the agent observability guide for implementation detail on multi-agent coordination.
Building production agents with robust memory?
That is a rare skill. Add your Verified Builder profile so hiring teams can find you — include your agent projects and the infrastructure you built. Builders with hands-on episodic and semantic memory architecture experience are among the most sought-after profiles on AI Tech Connect right now.
Add your profile →Common pitfalls and debug patterns
Even with all four memory types in place, agents fail in predictable ways. Knowing the failure signatures saves hours of debugging.
The stale episodic memory trap. Episodic stores accumulate without a TTL and eventually surface memories from contexts that are no longer relevant — a user preference from a cancelled account, a decision superseded by a later one. Add a TTL or relevance decay to your episodic store, and write a superseded flag to memories that have been explicitly overridden. Query-time filters on TTL and supersession status are cheaper than a large memory store returning noise.
Symptom: agent contradicts itself between turns. The cause is almost always a broken working memory boundary — either the context window was silently truncated by the API, or a summarisation step dropped a key commitment. Fix: add explicit logging of token counts at the start of every turn; log what was compressed in every summarisation call. The discrepancy will be visible in the diff between what the agent said three turns ago and what the summary retained.
Symptom: agent retrieves the right memories but still gives wrong answers. This is the hardest failure because the memory system appears to be working. The usual cause is that retrieved content was injected into the prompt but the model failed to attend to it — either because the injection was buried too deep in a long prompt or because it conflicted with stronger priors from training. Fix: inject retrieved memories close to the user message, not in the system prompt; use explicit framing ("Based on the following retrieved context, …"); and evaluate faithfulness with a golden set as you would for a RAG pipeline. The production RAG guide covers faithfulness evaluation tooling that applies directly to episodic and semantic retrieval.
Symptom: background worker re-processes items it already handled. The episodic deduplication store is either not being written to on completion, or the query is not matching the stored ID format. Write a structured dedup record immediately on successful completion of each item, and verify the exact query that the next run will use against a recent dedup record in a unit test. For cost tradeoffs, see our LLM cost optimisation guide on caching strategies that compound well with memory architecture.
Symptom: latency spikes on episodic or semantic retrieval. Two causes: index has grown large enough that query time degraded, or the embedding call is happening synchronously in the hot path. For the first, add an HNSW index to pgvector (CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)); for the second, pre-embed queries where possible or run embedding and retrieval concurrently with the context management step. The skills gap in production AI engineering is most acute precisely here — builders who can tune retrieval latency under real load are genuinely rare.
Summary: the memory stack at a glance
Agent memory is not a single problem — it is four interlocking problems, each with a different implementation, a different cost profile, and a different failure mode. Working memory is cheap but ephemeral; manage it with rolling summarisation and a context budget. Episodic memory is durable but adds retrieval latency; write selectively and query by recency-weighted relevance. Semantic memory unlocks knowledge-intensive agents but requires an indexed corpus and a refresh pipeline. Procedural memory is often overlooked but is the lever that determines what the agent can actually do; invest in your tool registry and descriptions early.
The builders shipping reliable long-horizon agents in 2026 are not using any particular framework — they are using all four memory types deliberately, instrumenting the boundaries between them, and measuring what their agents actually remember versus what they confabulate. Start with the type that addresses your agent's most urgent failure mode, then expand the stack from there.
For observability tooling that makes memory failures visible in production, see our guide on instrumenting agents with OpenTelemetry. For RAG-specific retrieval optimisation that feeds into semantic memory, the hybrid retrieval guide covers the full pipeline. For LangGraph-specific state management patterns that implement some of what is described here in a graph-based agent framework, see our LangGraph step-by-step guide.