What shifted in April
Three substantive papers landed on arXiv this month, and together they argue that the dominant pattern of the last two years — embed the query, fetch top-k chunks, stuff them into the prompt — is finally being retired in serious systems. The replacement is not a single new technique. It is a category shift: retrieval-augmented generation moves from a fixed pre-processing step to a set of tools the model can invoke, plan around, and re-issue.
The headline contributions: A-RAG (arXiv 2602.03442) shows that exposing three retrieval tools at three granularities — keyword, semantic, chunk-read — outperforms strong baselines while spending the same or fewer retrieved tokens. InfoDeepSeek (arXiv 2505.15872) provides the first benchmark targeted at agentic information seeking inside a RAG loop, and the framework it proposes is genuinely modular. SoK Agentic RAG (arXiv 2603.07379) is the first attempt to systematise the design space along axes like agent cardinality, control structure, autonomy and knowledge representation.
The practical takeaway: flat-chunk retrieval is now a legacy pattern for any non-trivial query workload. Multi-hop, comparative, and aggregate queries break it. The cure that consistently wins is to expose retrieval as a small set of typed tools and let the model orchestrate, while evaluating retrieval trajectories rather than just final answers. A related survey, Towards Agentic RAG with Deep Reasoning (arXiv 2507.09477), reaches similar conclusions across the wider RAG-Reasoning literature.
Before you migrate anything, instrument your existing RAG pipeline to log the full retrieval trajectory — query, retrieved doc IDs, scores, generated answer. You cannot evaluate trajectory quality without this telemetry, and you cannot prove a hierarchical refactor was worth the engineering cost without comparing trajectories side by side.
Why flat-chunk RAG breaks at scale — three failure modes
The April papers do not invent the criticisms of flat RAG, but they do organise them. Three failure modes recur across the empirical sections.
Failure 1: granularity mismatch. A user asks "which clauses in our supplier contracts most often trigger renegotiation?" — an aggregate question that requires scanning many documents at low resolution before drilling into a few at high resolution. Flat RAG fetches 8 chunks of 512 tokens and either misses the breadth or burns the context window. No single granularity is correct.
Failure 2: missing the iteration loop. Multi-hop queries — "find the portfolio company that recently raised a Series B, then tell me their last quarter's churn" — require the second retrieval to be conditioned on the first. Flat RAG does one retrieval. Bolt-on recursive retrievers tend to fetch too eagerly because they do not know when to stop.
Failure 3: opaque trajectories. When a flat RAG answer is wrong, you cannot easily tell whether the retriever missed, the reranker reordered badly, or the generator hallucinated despite good context. The pipeline is a debugging black box, which makes incremental improvement painful.
Paper 1 — A-RAG: hierarchical retrieval interfaces
A-RAG (arXiv 2602.03442) is the most directly actionable of the three. Its proposal is simple to describe and the empirical results are strong. Instead of a single retrieval step, the LLM is given three tools, each operating at a different level of granularity:
- Keyword search — fast, sparse, returns document titles and short summaries. Cheap. Useful for the model to scan the corpus and orient itself.
- Semantic search — dense embedding, returns chunk-level passages. The classic RAG primitive, but now used selectively rather than as the universal entry point.
- Chunk read — given a specific document or chunk ID, fetch its full text. Used after the other two have narrowed the search space.
The behaviour the paper reports is that capable models learn to use the cheap tool first, narrow the candidate set, and only spend on chunk reads when needed. The reported gains are not on every benchmark — single-hop fact lookup is barely affected — but on multi-hop and aggregation queries the numbers move noticeably while retrieved-token counts stay flat or drop.
The implementation pattern transfers cleanly. If your retriever already supports both keyword (BM25 or similar) and dense modes, exposing them as separate tools is a half-day of work. The harder part is teaching the model how to use them — which is largely a system-prompt and few-shot problem at the frontier-model tier, and a fine-tuning problem below it.
Paper 2 — InfoDeepSeek: benchmarking the trajectory, not just the answer
InfoDeepSeek (arXiv 2505.15872) makes a different but complementary contribution. It proposes a generalisable agentic RAG framework with four modular components — planning, memory, tool use, and generation — augmented with a reflection step. More importantly, it ships a benchmark designed for agentic information seeking, which is something the field has been short on.
The reason the benchmark matters: most existing RAG evaluations measure final-answer accuracy against a gold reference. That metric hides cases where the model retrieved wastefully but still landed on the correct answer, and it hides cases where a wrong answer came from a retrieval path that was nearly correct. InfoDeepSeek scores the trajectory — were the right tools called, in a sensible order, with reasonable arguments? Was the model's eventual stopping decision warranted by the evidence it had gathered?
If you are building an evaluation harness for a production RAG system, the InfoDeepSeek framing is the right one. Log every tool call. Score path quality alongside answer quality. Treat trajectory regressions as bugs even when answer accuracy is unchanged — they are predictive of future failures.
Resist the temptation to expose ten retrieval tools just because tool-use APIs make it easy. Smaller models — anything under roughly 8B parameters — get visibly worse with tool counts above three or four. The model burns budget choosing between near-duplicate tools and your trajectory quality collapses. Three tools at distinct granularities is the sweet spot the April papers consistently land on.
Paper 3 — SoK Agentic RAG: the taxonomy your architecture sits inside
The SoK paper (arXiv 2603.07379) is not building a system; it is mapping the field. The taxonomy proposes four orthogonal axes:
- Agent cardinality — single-agent loops versus multi-agent systems where a planner agent delegates to retriever agents.
- Control structure — fixed pipelines versus model-driven loops versus hybrid approaches with deterministic guardrails.
- Autonomy — how much budget the model controls (number of tool calls, when to stop) versus how much is enforced externally.
- Knowledge representation — flat chunks, hierarchical structures, knowledge graphs, or hybrid stores.
The value here is diagnostic. If you cannot place your current RAG system on each of these axes in under a minute, you have implicit design choices baked in that you have not consciously made. The SoK paper also argues, alongside InfoDeepSeek, for trajectory-level evaluation as the field-wide default — partly because the static answer metrics popularised by early benchmarks systematically favour the simplest architectures regardless of generalisation.
The pattern that emerges across all three
Read together, the papers converge on a tight design pattern. Expose retrieval as a small, typed set of tools at distinct granularities. Let the model drive the loop. Evaluate trajectories. Keep tool count modest. The architectural comparison below summarises how the three families compare in practice.
| Pattern | Latency | Token cost | Multi-hop quality | Observability |
|---|---|---|---|---|
| Flat-chunk RAG | Lowest — single retrieval | Lowest per query, but inflated by irrelevant chunks | Poor — no iteration | Simple but opaque on failures |
| Agentic RAG, single tool | Medium — model chooses when to fetch | Medium — fewer wasted retrievals | Moderate — model can re-query | Better — every call logged |
| Agentic RAG, hierarchical tools | Medium-high — more decisions per turn | Comparable to single-tool, often lower | Strong — granularity matched to query | Best — typed tools make trajectories analysable |
A minimal hierarchical retrieval tool definition
The implementation footprint is smaller than people expect. Here is the pattern in Anthropic tool-use format — the OpenAI equivalent is a one-line transformation. Define each retrieval mode as a tool with a clear description; the description is what the model conditions on.
tools = [
{
"name": "keyword_search",
"description": "Fast sparse search over document titles and summaries. Use first to scan the corpus and identify candidate documents. Returns up to 20 (doc_id, title, summary) tuples.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Keywords or short phrase."},
"limit": {"type": "integer", "default": 20}
},
"required": ["query"]
}
},
{
"name": "semantic_search",
"description": "Dense embedding search at chunk level. Use after keyword_search has narrowed the candidate set, or for queries with no obvious keywords. Returns up to 8 chunks with scores.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"doc_ids": {"type": "array", "items": {"type": "string"},
"description": "Optional — restrict search to these docs."},
"limit": {"type": "integer", "default": 8}
},
"required": ["query"]
}
},
{
"name": "chunk_read",
"description": "Fetch the full text of a specific chunk by ID. Use only after the other tools have identified relevant chunks.",
"input_schema": {
"type": "object",
"properties": {
"chunk_id": {"type": "string"}
},
"required": ["chunk_id"]
}
}
]
The tool descriptions are doing real work. They tell the model when each tool is appropriate, which is most of how A-RAG-style behaviour emerges without fine-tuning.
"We migrated a legal-document RAG from flat top-k to a three-tool hierarchical setup over a single sprint. The win was not raw accuracy — that moved a few points. It was that our trajectory logs finally told us why each answer happened. Debugging went from archaeology to engineering. The flat pipeline had been a black box for a year and we had not realised it."
— Sophie, Rising Builder · Birmingham, UKWant to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →Implementation playbook — refactoring flat RAG to hierarchical-tool RAG
If you are convinced enough to plan a sprint, the order of operations matters. Three steps, each independently shippable.
Step 1: instrument trajectories on your existing pipeline
Before changing anything, log the full retrieval trajectory — query string, retrieved chunk IDs, similarity scores, the final answer, and any reranker outputs. Keep this for two weeks of production traffic. You will use it as a baseline and as a held-out evaluation set. This is also the moment to define your trajectory-quality metric, even if it is initially as simple as "did the right document appear in the retrieved set?" — you will refine it later.
Step 2: add a typed tool layer over your retriever
Wrap your existing retriever in two or three tools — keyword and semantic at minimum, chunk read if you have document-level identifiers. Do not change the underlying retriever yet; this is purely a packaging change. Switch your generation pipeline to a tool-use loop and run it side-by-side with the flat pipeline on the held-out set. Compare trajectory metrics. Expect that for single-hop queries the new pipeline is at parity with the old, and for multi-hop queries it pulls ahead.
Step 3: tune tool descriptions and budgets
The single biggest lever you control is tool descriptions. Iterate on them based on the trajectory failures you see. Add an explicit budget — maximum five tool calls per query, say — to bound worst-case cost. Add a reflection step (InfoDeepSeek-style) where the model briefly evaluates the gathered evidence before answering. This is the step where the curve actually bends; expect to spend longer here than on the previous two combined.
Open questions — what the papers do not yet solve
Three honest gaps. Cost variance — agentic RAG has higher tail latency than flat RAG because some queries trigger long tool-use loops, and none of the April papers have a satisfying answer beyond hard budget caps. Evaluation harness reliability — trajectory scoring is partially subjective, and inter-annotator agreement is not high. Observability at scale — a single trajectory is easy to inspect, a million is not, and the tooling for analysing trajectory distributions across a fleet does not yet exist off the shelf.
None of these block adoption. They do mean the gap between "this paper says it works" and "this works reliably in production" is non-trivial. Plan for an evaluation investment roughly the size of your retriever investment.
Primary sources: A-RAG (2602.03442), InfoDeepSeek (2505.15872), SoK Agentic RAG (2603.07379), Towards Agentic RAG with Deep Reasoning (2507.09477).