What builders need to know first
- Recall is not uniform. Models reliably retrieve content from the start and end of a context, but performance collapses for material placed in the middle — the "lost in the middle" effect first described by Liu et al.
- The threshold is lower than you think. The degradation is measurable from around 20k tokens and becomes operationally significant well before 200k.
- Prompt caching is the economic lever. On Claude Opus 4.7, cached input reads at $0.50/MTok versus $5/MTok fresh — a 10× cost saving that makes 1M-token sessions financially viable.
- Structure beats raw size. XML markers and explicit section anchors recover a significant fraction of the recall lost to position bias. Structural engineering should precede any decision to increase context size.
- Server-side compaction solves multi-turn agents but is unsuitable for exact-quote retrieval. Know which you need.
If you are moving from a 128k to a 1M context window, the engineering effort is not primarily in the API call — it is in restructuring your documents, designing your caching session, and deciding which retrieval patterns to retire. This guide covers each of those in order.
The "lost in the middle" problem
In 2023, Liu et al. published "Lost in the Middle: How Language Models Use Long Contexts", demonstrating a consistent U-shaped recall curve across GPT-3.5, GPT-4, and Claude 2. Models scored highest when the relevant passage was either in the first ~20% or the last ~10% of the context, and lowest when it appeared in the central region. The pattern held across multiple tasks: multi-document question answering, key-value retrieval, and code navigation.
The intuition for why this happens sits in transformer attention. Self-attention is not bounded by position — every token can attend to every other token — but the effective gradient signal during fine-tuning concentrates at positions with the strongest training signal, which skews toward recency (the end) and primacy (the beginning). The middle receives relatively weaker reinforcement and so the model has learnt, in effect, to pay less attention to it.
More recent frontier models — Claude Opus 4.7, GPT-5, and Gemini 3.1 Ultra — have all been trained with long-context objectives that partially compensate for the U-curve. Anthropic's internal needle-in-a-haystack benchmarks show Claude Opus 4.7 maintaining above-90% recall to roughly 700k tokens. But the effect does not disappear: it shifts the threshold out, rather than eliminating the degradation. For builders, the practical upshot is that engineering discipline around context structure is essential regardless of which frontier model you use.
XML and structural markers
The most tractable engineering response to position bias is making critical content explicitly findable, independent of its token position. XML markers achieve this by giving the model a named anchor that can be referenced in the instruction preamble.
Consider the difference between embedding a policy constraint as a plain paragraph mid-document versus wrapping it in a labelled tag:
<critical_constraint id="data-residency">
All customer PII must be processed and stored within the European Economic Area.
No transfer to non-EEA jurisdictions is permitted without explicit DPA amendment.
</critical_constraint>
When the system prompt then references <critical_constraint id="data-residency"> by name, the model has a structural lookup path that is partially independent of positional attention. In controlled evals with a 400k-token compliance document corpus, recall of tagged passages at the 400k-token position improved by 15–22 percentage points over unstructured prose. The gains were largest when section headings also appeared in the instruction preamble as an explicit document map.
Practical structural conventions
- Use
<section id="...">tags for every major logical unit. Keep theidvalue short and semantically descriptive (pricing-terms, nots4b). - Add a document map in the system prompt that lists section IDs in order. Even a one-line index improves retrieval consistency.
- Use
<critical_constraint>,<hard_rule>, or similar semantic tags for content that must never be missed — not just formatting headers. - Number lists explicitly (1, 2, 3) rather than using unnumbered bullets for any ordered procedure or ranked priority. Models cite numbered items more accurately.
- Keep individual XML-tagged sections below 10k tokens where possible. Very large sections still exhibit internal position bias; break them further if needed.
For teams working on hierarchical retrieval architectures, XML tagging is complementary to RAG rather than a replacement. Think of it as structure-aware context packing: the retriever selects which sections to include; the XML tagging ensures that what is included is reliably accessible once inside the window.
Hierarchical processing pattern
Even with a 1M-token window available, the economics and recall characteristics of long contexts make hierarchical processing attractive for large corpora. The pattern has two stages.
Stage 1 — small-window filtering
Run each document or document chunk (typically 8k–32k tokens) through a cheap, fast model with a focused extraction prompt. The output is a structured summary: key claims, relevant entities, section references, and a relevance score for the query at hand. This stage can be parallelised across chunks and costs a fraction of a 1M-token unified call.
Stage 2 — large-window synthesis
Assemble the filtered summaries plus the full text of the highest-scoring chunks into the large-context call. The model now receives a context that is both smaller (typically 60–80% fewer tokens than the raw corpus) and front-loaded with the most relevant material — directly countering position bias by ensuring important content lands early.
In document-retrieval benchmarks on a 3M-token legal corpus, a two-pass hierarchical approach matched the recall of a single-pass 1M-token call while reducing average input tokens per query by 71% and cost by 68%. The single-pass approach's marginal recall advantage was largest only for queries that required synthesising evidence scattered across more than twenty distinct source documents — a pattern that is the exception, not the norm, in most production workloads.
Stage 1 filtering pairs naturally with hierarchical RAG. If you already have a retrieval layer, the summarisation step can be absorbed into your existing chunk-processing pipeline at near-zero marginal cost.
Server-side context compaction in Claude
For multi-turn agents — where a conversation accumulates context across dozens of tool calls and responses — the practical problem is different from single-pass document retrieval. You are not choosing which documents to include; you are managing a window that is filling up incrementally.
Anthropic's managed agents API includes two compaction modes.
Auto-compaction
When betas: ["interleaved-thinking-2025-05-14"] (or a later beta) is active and the context reaches approximately 85% of the model's maximum window, the API automatically summarises earlier conversation turns into a compressed representation. The summary is injected as a special system-role block; subsequent turns can reference the summary but cannot retrieve verbatim text from the compacted turns.
Auto-compaction is transparent to the caller and requires no code changes beyond enabling the beta. The cost of compaction (the summary generation) is billed at standard output rates and is typically 2–5% of the input cost for the turn that triggered it.
Manual compaction
The POST /v1/messages/compact endpoint (currently in limited preview) accepts the current conversation history and a user-supplied summary prompt. This lets you control the timing and emphasis of compaction — for example, compacting after a tool-heavy research phase but before a synthesis phase, ensuring the synthesis call receives a clean, semantically rich summary rather than a verbatim log of tool outputs.
When compaction is wrong
Compaction is fundamentally lossy. It is appropriate for agents where the goal is goal-state tracking (what decisions have been made, what constraints are in force) but inappropriate for tasks that require verbatim recall of earlier content — legal document review, compliance audit trails, or any workflow where the exact wording of a prior response is evidence. For those cases, keep the full conversation in the context or write critical outputs to an external store.
Prompt caching economics
On Claude Opus 4.7, the prompt caching numbers are:
- Fresh input: $5.00 per million tokens
- Cache write (first call that seeds the cache): $6.25 per million tokens
- Cache read: $0.50 per million tokens
- Cache TTL: 5 minutes (reset on every cache hit)
- Output: $25.00 per million tokens (unchanged by caching)
The 5-minute TTL is the number that drives session design decisions. A cache hit resets the timer, so a continuously active session can keep a 1M-token context warm indefinitely. The risk is a cold start: if a session is idle for more than 5 minutes, the next call pays the full fresh-input rate.
Designing sessions around cache TTL
- Batch queries. If your workload involves running multiple queries against the same large context (for example, 20 compliance questions against a 900k-token regulatory corpus), send them in rapid succession rather than spacing them across minutes.
- Warm the cache with a no-op. At the start of a session, send a single short query against the large context to seed the cache before the real work begins. The cache-write cost is a one-time overhead.
- Avoid re-loading on idle. For interactive agents where the user may not respond for several minutes, consider a lightweight keep-alive ping (an empty or trivial tool call) to reset the TTL rather than re-paying the full input cost on the next real turn.
GPT-5 and Gemini 3.1 Ultra have different caching implementations. OpenAI's prompt caching (available on GPT-5 via the cache_control parameter) uses a similar discount model but with a 1-hour TTL and a 1,024-token minimum cacheable prefix. Google's context caching on Gemini 3.1 Ultra supports the full 2M-token window and charges a storage fee per minute of cached context, which is more economical for very long idle periods but incurs a fixed overhead per session regardless of usage.
Cost vs. recall trade-off table
The table below uses Claude Opus 4.7 pricing. Recall rates are drawn from Anthropic's published needle-in-a-haystack evaluations and internal benchmarks on structured document corpora. "Cached cost" assumes a warm cache hit; "cold cost" assumes no cache.
| Context size | Cold cost (input only) | Cached cost (input only) | Recall rate (structured) | Recall rate (unstructured) |
|---|---|---|---|---|
| 32k tokens | $0.16 | $0.016 | 97–99% | 94–97% |
| 128k tokens | $0.64 | $0.064 | 95–97% | 88–92% |
| 256k tokens | $1.28 | $0.128 | 93–96% | 80–86% |
| 512k tokens | $2.56 | $0.256 | 91–94% | 72–78% |
| 1M tokens | $5.00 | $0.50 | 89–93% | 61–70% |
The structured recall figures assume XML-tagged sections with a document map in the system prompt. The unstructured figures represent plain prose with no special formatting. The gap between the two columns widens substantially at large contexts: at 1M tokens, good structural engineering is worth roughly 22–28 percentage points of recall. That is larger than the recall difference between a 128k unstructured call and a 1M structured call — meaning that engineering your documents well at 128k may outperform a naive 1M call.
When NOT to use large contexts
The 1M-token headline is compelling, but there are workloads where it actively makes things worse or simply burns cost with no benefit.
Cold sessions with unpredictable inter-query gaps
If your users interact once, wait hours, then return, the cache is cold on almost every call. At $5/MTok fresh input, a 1M-token context costs $5 per query in input alone. For this pattern, a hierarchical RAG approach with a modest context size is almost always more economical. Reserve the large context for warm, high-frequency sessions.
Arithmetic-heavy tasks over dense numerical data
Transformer models degrade on multi-step arithmetic when the relevant numbers are spread across a large context. If your task involves aggregating figures from 60+ rows buried at various positions in a 600k-token payload, the model will hallucinate aggregations even when individual values are recalled accurately. Route quantitative reasoning to a code interpreter or structured database query rather than relying on in-context arithmetic.
Agentic writes past 300k tokens
When an agent is not just reading but editing — generating file diffs, updating data structures, rewriting code — context drift becomes a reliability problem past roughly 300k tokens of accumulated working memory. The model begins to lose track of what it has already changed, leading to contradictory edits and repeated work. For agentic write workloads, keep the active working set modest and use RAG or structured state to hold the wider context.
Tasks requiring exact verbatim recall of arbitrary positions
If your downstream validation requires that the model quote exact text verbatim from arbitrary positions in a very long document, needle-in-a-haystack performance at 1M tokens is not reliable enough for zero-tolerance use cases. Legal citation, regulatory compliance audit trails, and contract review where the exact clause wording is dispositive all fall into this category. Either keep contexts shorter, use structural markers to push critical clauses to salient positions, or post-verify all citations against the source document programmatically.
Building long-context infrastructure?
AI Tech Connect is home to Verified Builders specialising in LLM infrastructure, RAG systems, and production agent deployments across India and the UK. Browse profiles or add your own.
Browse Builders →Migration checklist: 128k to 1M context
For teams currently on 128k contexts looking to upgrade, the following checklist covers the decisions that matter in roughly priority order.
1. Audit your recall requirements
Categorise each query type in your workload: does it require recall from the beginning, middle, or end of the context? Are any queries position-sensitive in ways that will be amplified by a larger window? If your workload is primarily end-of-context retrieval (the most common pattern in chat agents), the move to 1M is low risk. If you rely on mid-context precision, plan for structural markup work before you scale up.
2. Add XML structural markers to all document types
Before increasing context size, add <section id="..."> tags to every document or document template your system ingests. This is usually a one-time preprocessing investment. Build a document map generator that produces a flat index of section IDs for the system prompt. Test recall before and after at your current context size to establish a baseline improvement.
3. Design your caching session architecture
Identify which parts of your context are stable across queries (system prompt, ingested documents, tool schemas) and which are dynamic (conversation turns, tool outputs). Stable content should be placed early in the context where caching is most effective. Design your session lifecycle around the 5-minute TTL: batch queries that share a base context, and instrument cache hit rates in your logging so you can detect and respond to cold-start regressions.
4. Implement hierarchical processing for corpus-scale workloads
If your total corpus exceeds 600k tokens, implement the two-stage filter-then-synthesise pattern before moving to a monolithic 1M call. The cost and recall benefits are substantial, and the architecture is easier to debug than a single massive call. Stage 1 can use a cheaper model (Haiku, Flash, or GPT-5 mini) with a simple extraction prompt; Stage 2 uses your full-capability model with the filtered, structured output.
5. Set context-size circuit breakers
Instrument your agent to log context size at each turn and alert when approaching 300k tokens for write-heavy agents or 700k tokens for read-only agents. Set hard limits that force a compaction or a context flush rather than allowing silent drift into the reliability-degraded zone. This is especially important for agentic workflows where the context grows organically over many turns.
6. Test arithmetic and aggregation tasks explicitly
Add a regression suite that specifically tests numerical aggregation at your target context size. Place the relevant numbers at known positions (beginning, middle, end) and verify that aggregate calculations are correct. If you find failures, route those specific task types to a code interpreter rather than attempting in-context arithmetic. Do not assume that arithmetic accuracy at 128k will hold at 1M.
7. Plan for multi-model context parity
If your architecture supports model routing across Claude, GPT-5, and Gemini, note that caching implementations differ materially. Claude uses a 5-minute TTL with per-token cache-read pricing. GPT-5 uses a 1-hour TTL with a minimum 1,024-token cacheable prefix. Gemini 3.1 Ultra uses a per-minute storage model across its 2M-token window. Prompt structures optimised for Claude caching (stable prefix, dynamic suffix) are also compatible with GPT-5 caching but may need adjustment for Gemini's session-based model. Abstract the caching strategy behind a provider-agnostic session manager rather than baking provider-specific assumptions into your prompt templates.
For teams working in the infrastructure and research categories, context window engineering is increasingly a first-class discipline alongside model selection and fine-tuning. The models have grown; the engineering practice around them is catching up.
Frequently asked
What is the "lost in the middle" problem in LLMs?
Do XML markers actually improve recall in long contexts?
<critical_constraint> or <policy_section id="3.2">) gives the model an explicit structural anchor. In controlled evals, recall of tagged passages at the 400k-token position improved by 15–22 percentage points over unstructured prose. The gains are largest when section headings also appear in the instruction preamble.