What builders shipping RAG need to know

  • Retrieval, not generation, is the failure point — when a RAG answer is wrong, the relevant passage was missing or buried roughly 73% of the time.
  • Chunking is the highest-ROI lever — it costs nothing extra at query time and decides what the retriever can even find.
  • Start with recursive character splitting at 512 tokens with 50-100 tokens of overlap — the benchmark-validated default for 2026.
  • Reach for semantic chunking selectively — it helps documents with abrupt topic shifts and earns its cost only when structure is uneven.

Retrieval-augmented generation has become the default pattern for grounding a language model in your own data — whether that is a corpus of Indian case law, a UK fintech's internal knowledge base, product documentation, or support tickets. The architecture is now well understood: split documents into chunks, embed them, store the vectors, retrieve the closest matches to a query, and hand them to the model as context.

Yet most RAG systems in production underperform, and teams reach for the wrong fix. They swap the language model, tune the prompt, or raise the temperature. They rarely look at the step that actually broke. This playbook is about that step — and specifically about chunking, the single decision that quietly determines whether your retriever can find the right passage at all.

Why chunking is the highest-ROI lever

Here is the uncomfortable finding that should reframe how you debug RAG. When a retrieval-augmented system produces a wrong or unsupported answer, the cause is the retrieval step — not the generation step — roughly 73% of the time. The model was perfectly capable of answering correctly. It simply never received the passage it needed, because the retriever did not surface it.

That changes where your effort should go. Swapping a frontier model for a slightly better one improves the 27% of failures that are genuinely generation problems. Fixing retrieval addresses the other 73%. And within retrieval, chunking is the cheapest lever you have: it is decided once, at ingestion time, and costs nothing extra on every query thereafter. A better reranker adds latency to every request. A better chunking strategy does not.

Think of a chunk as the smallest unit your system can ever retrieve. If a chunk is too large, a single relevant sentence is diluted by surrounding noise and its embedding drifts away from the query. If a chunk is too small, it loses the context that made it meaningful — a clause in a contract means little without the definitions around it. If a chunk splits mid-thought, the half that matched the query may not be the half that answers it. Every retrieval failure mode traces back, in part, to how you cut the text.

Pro tip

Before you touch the model or the prompt, log the retrieved chunks for your last 50 failed queries and read them by hand. If the right answer is not in the retrieved set, no amount of prompt engineering will save you — and you have just confirmed the failure is retrieval, not generation.

The chunking strategies, explained plainly

There are five strategies worth knowing. They sit on a spectrum from cheap and naive to expensive and structure-aware.

Fixed-size chunking

The simplest approach: cut the text every N characters or tokens, regardless of where sentences or paragraphs fall. It is fast and predictable, but it routinely splits sentences in half and severs ideas from their context. Use it only for a quick prototype, or for text that is already uniform and short.

Recursive character splitting

The pragmatic default. It tries to split on the largest natural boundary first — paragraph breaks — then falls back to sentence breaks, then word breaks, until each chunk fits the target size. The result respects the structure of the text far more than fixed-size cutting while remaining cheap and deterministic. This is what most production RAG systems should start with.

Semantic chunking

Instead of counting tokens, this approach embeds sentences and looks for points where the embedding similarity between consecutive sentences drops sharply — a signal that the topic has shifted. Chunk boundaries are placed at those shifts. The chunks vary in length but each one holds a coherent idea. It costs more, because you embed the document twice over, but it can meaningfully help messy content.

Document-specific chunking

This respects the native structure of a format. Markdown is split on headings; HTML on semantic tags; code on functions and classes; a legal document on sections and sub-clauses. If your corpus is one consistent format — say, a UK fintech's policy documents all authored in the same template, or Indian statutes with a regular section hierarchy — this often beats generic splitting outright.

Parent-child chunking

You index small chunks for precise matching but, once a small chunk is retrieved, you hand the model its larger parent chunk for context. The retriever gets the precision of a tight embedding; the model gets the surrounding paragraph it needs to reason. It adds a layer of bookkeeping but resolves the precision-versus-context tension elegantly.

Strategy How it splits Best for Cost
Fixed-size Every N tokens, ignoring structure Quick prototypes, uniform short text Lowest
Recursive character Largest natural boundary first, then smaller General-purpose production default Low
Document-specific Native structure — headings, tags, clauses Single consistent format (Markdown, code, statutes) Low–medium
Semantic Embedding-similarity drop between sentences Uneven content with abrupt topic shifts High (extra embedding pass)
Parent-child Small chunks indexed, larger parent returned Precision matching plus context for the model Medium (extra storage and lookup)

The benchmark-validated default

If you want one starting point that works across most corpora, here it is: recursive character splitting at 512 tokens per chunk, with 50-100 tokens of overlap. Per 2026 benchmark testing — the largest evaluation of chunking strategies run on real-world documents this year — that configuration scored around 69% retrieval accuracy, ahead of fixed-size cutting and competitive with far more expensive approaches.

512 tokens is a deliberate middle ground. It is large enough to carry a complete idea with its immediate context, and small enough that a single relevant passage is not drowned by surrounding noise in the embedding. It also sits comfortably inside the input limits of every mainstream embedding model, so you are not forced to truncate.

This is a default, not a verdict. Treat 512 tokens as the number you start from and then measure against your own corpus. Dense reference material — a UK fintech's regulatory handbook, where each paragraph is self-contained — can often go smaller, to 256 tokens, and gain precision. Narrative or argumentative text, such as the reasoning in an Indian appellate judgment, usually needs more room, and 768 to 1,024 tokens preserves the thread better. The benchmark gives you a credible place to begin; your retrieval metrics tell you where to land.

Overlap strategy

Overlap means each chunk repeats the last slice of the previous one. It exists to solve a specific failure: a sentence that answers the query sits exactly on a chunk boundary, with half its meaning on each side, so neither chunk embeds cleanly enough to be retrieved. Overlap guarantees that every span of text appears whole in at least one chunk.

A sensible starting point is 10-20% of chunk size — so 50-100 tokens of overlap for a 500-token chunk. That is enough to bridge boundaries without bloating your index. The payoff is real: overlap has been shown to improve recall in dense retrievers by up to around 14.5%, simply by ensuring that boundary-spanning answers are no longer lost.

Watch out

Do not over-correct. Pushing overlap past 30% inflates your vector store, slows ingestion, and returns near-duplicate chunks in the top results — which crowds out genuinely distinct passages and can lower answer quality. More overlap is not more recall; 10-20% is the band that pays off.

Semantic chunking — when it is worth it

Semantic chunking is the most talked-about strategy and the most over-applied. It detects topic boundaries using embedding similarity instead of fixed token counts: where two consecutive sentences are semantically far apart, it places a cut. The result is chunks that each hold one coherent idea, regardless of how long that idea runs.

It earns its cost when your documents have abrupt, uneven structure. A research report that swaps between methodology, results, and policy implications; a bundle of regulations stitched from different sources; a customer-call transcript that jumps between unrelated complaints — in all of these, a fixed 512-token window will routinely staple two unrelated ideas into one chunk and dilute both. Semantic boundaries fix that.

It is not worth it when your content is already uniform and well structured. If every document follows the same template and every section is a self-contained idea — much of a UK fintech's internal documentation, or product reference pages — recursive or document-specific splitting captures the same boundaries for free. In that case, semantic chunking adds an embedding pass and slower ingestion for an accuracy gain that is often too small to measure. Run the comparison on a held-out query set before committing; if recursive splitting is within a couple of points, keep the cheaper option.

Beyond chunking — the next three layers

Chunking gets you most of the way, but once it is tuned, three further layers recover the accuracy that chunk-size tinkering no longer will.

Metadata filtering. Attach structured fields to every chunk at ingestion — document type, date, jurisdiction, product line, author. Then filter the candidate set before the vector search runs. A query about Indian data-protection obligations should never be compared against UK-only chunks; a filter removes them outright and sharpens everything that follows.

Hybrid retrieval. Vector search captures meaning but misses exact terms — error codes, statute numbers, product SKUs, named entities. Keyword search (BM25) catches those precisely but misses paraphrase. Running both and merging the results gives you the strengths of each; this is now the standard production setup rather than an optimisation. Your choice of vector store matters here too — our comparison of Pinecone, Qdrant, pgvector and Turbopuffer covers which engines make hybrid search straightforward.

Cross-encoder reranking. The initial vector search is fast but approximate. A cross-encoder reads the query and each candidate chunk together and scores their genuine relevance, then reorders the top results. Retrieve 20 candidates cheaply, rerank them, keep the best 5. It adds latency, so apply it only to the shortlist — but it is often the single biggest accuracy gain after chunking. If you are wiring this into an agent loop, our overview of agent frameworks for 2026 shows where retrieval sits in the wider orchestration.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Match the strategy to your document, measure, iterate

There is no universally best chunking strategy — there is only the strategy that fits your documents, proven against your own queries. The teams that ship reliable RAG are not the ones that picked the cleverest method; they are the ones that built a measurement loop and ran it.

Here is the checklist to work through:

  • Build an evaluation set first. Collect 50-100 real queries with known-correct source passages. Without this you are guessing, and you cannot tell an improvement from a regression.
  • Start with the validated default. Recursive character splitting at 512 tokens, 50-100 tokens of overlap. Measure retrieval accuracy before changing anything.
  • Match the strategy to the format. One consistent format — use document-specific splitting. Uneven, topic-shifting content — test semantic chunking against the default and keep it only if it clearly wins.
  • Tune chunk size to content density. Dense reference material can go smaller; narrative and argumentative text needs more room. Let the metrics decide, not intuition.
  • Add the next layers in order. Metadata filtering, then hybrid retrieval, then cross-encoder reranking — measuring after each so you know what actually moved the number.
  • Re-measure when the corpus changes. New document types or formats can shift the best configuration. Treat chunking as something you revisit, not set once.

The builders worth learning from here are not chasing the newest model. They are the ones who treat retrieval as an engineering problem with a feedback loop — and who fixed the 73% before touching the 27%.