- What chunking strategy you use is the single highest-leverage decision in RAG retrieval quality — it sets the ceiling on everything downstream.
- Three production strategies compared with benchmark data: fixed chunking (the reliable baseline), semantic chunking (best for narrative text), and hierarchical chunking (three to five times F1 improvement on structured documents).
- The "measure before you switch" rule: always run your golden eval set against each chunking strategy before deploying a change to production.
Why chunking is the highest-leverage RAG decision
Most RAG failures happen at retrieval, not at generation. The language model at the end of the pipeline can only reason from what retrieval hands it — if the right passage never reaches the context window, no amount of prompting, model size, or generation cleverness will produce a correct answer. And the single biggest determinant of retrieval quality is not which vector database you use, or whether you have a re-ranker, or how large your embedding model is. It is how you split your documents into chunks.
A chunk is the unit you embed and retrieve. Its boundaries determine what context lives together in a single vector, and those boundaries determine what the retrieval step can possibly surface. Chunk too coarsely and the embedding averages over dozens of ideas, representing none of them sharply — precise queries find the right document section but drag in three irrelevant paragraphs along with the useful sentence. Chunk too finely and precise matches become fragmented — the exact sentence the user needs arrives without the surrounding context that makes it actionable, and the model hallucinates the missing framing.
The uncomfortable truth is that a bad chunking strategy cannot be compensated by a better embedding model. If a sentence boundary slices through the middle of a numbered clause, a causal argument, or a table row, the resulting chunks are incoherent at the embedding level. Upgrading from text-embedding-3-small to a larger model embeds that incoherence more precisely, but the incoherence is still there. The same applies to re-rankers: they reorder the candidates retrieval surfaces, but if the right chunk was split into two and neither half is useful alone, reordering a better half-answer above a worse one is not a win. Fix the split boundary and you remove the problem entirely.
This is why chunking strategy deserves the same systematic measurement as any other pipeline decision. The three strategies below represent the current production options as of mid-2026. Each has a clear best-fit document type, a cost profile, and a measurable F1 signature. Knowing which one fits your corpus — and measuring the answer with a golden eval set — is how you make the decision rationally rather than by copying a default.
Strategy 1 — Fixed chunking
Fixed chunking is the simplest and most widely deployed strategy. Split every document into windows of N tokens, advance by a stride that leaves M% of the previous window as overlap, and repeat. The overlap — typically 10 to 20 per cent — prevents splitting mid-thought: a sentence that begins at the end of one window and finishes at the start of the next appears in both, so at least one chunk holds it whole.
The recommended baseline is 400 to 600 tokens with 15% overlap. This figure is not a guess: it is the range that consistently holds up across benchmarks on homogeneous document corpora — wiki articles, news pieces, support tickets with consistent length — where chunks contain roughly one complete idea. Below 400 tokens, factual lookups sharpen but context bleeds away; above 600 tokens, coarse matching begins to dilute the embedding signal. The 15% overlap is a reasonable midpoint between preventing split-boundary losses and not wasting storage on duplicate content.
Fixed chunking is the right default for any pipeline that does not yet have a golden eval set to justify a more complex approach. It is cheap at ingestion time (pure text splitting, no model calls), deterministic, and easy to reason about when a retrieval result looks wrong. When it fails is equally predictable: structured documents — PDFs with section hierarchies, legal contracts, technical manuals, financial circulars — are chopped at arbitrary token counts regardless of their internal structure. A 512-token window that cuts through the middle of a numbered clause, a defined term, or a table row produces chunks that embed ambiguously and retrieve poorly. That failure mode is the motivation for hierarchical chunking.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Recommended baseline: 500 tokens, 15% overlap (75 tokens)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # target tokens per chunk
chunk_overlap=75, # ~15% overlap to prevent split-boundary loss
length_function=len, # swap for a token-counting function in production
separators=["\n\n", "\n", ". ", " ", ""], # respect natural boundaries first
)
with open("document.txt") as f:
text = f.read()
chunks = splitter.create_documents([text])
print(f"{len(chunks)} chunks created")
# Attach source metadata before embedding — essential for citations and filtering
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": "document.txt",
"chunk_index": i,
"strategy": "fixed",
})
One practical refinement: use RecursiveCharacterTextSplitter rather than a plain CharacterTextSplitter. The recursive variant tries a priority list of separators — paragraph breaks, then line breaks, then sentence-ending punctuation, then spaces — before falling back to a hard character cut. This means the overwhelming majority of splits land on natural boundaries even though the strategy is still "fixed size". A plain splitter cuts at exactly N characters regardless of what is there, producing half-sentences more often than it should.
Use a token-counting function rather than len() for length_function in production. Character counts and token counts diverge significantly for code, URLs, and non-ASCII text. Passing tiktoken.encoding_for_model("text-embedding-3-small").encode keeps your chunk sizes accurate relative to the embedding model's actual context window.
Strategy 2 — Semantic chunking
Semantic chunking replaces the fixed token window with an embedding-driven boundary detector. Rather than splitting at every N tokens, it embeds each sentence, computes cosine similarity between adjacent sentences, and places a split boundary wherever similarity drops below a threshold. The intuition is that a topic shift — the moment an argument concludes and a new one begins — shows up as a dip in the embedding similarity curve between consecutive sentences. Splits track meaning boundaries rather than counting characters.
This approach works best on narrative documents: long-form prose, research papers, support tickets with complex explanations, emails that weave between topics. These are documents where a fixed window frequently cuts through a single continuous argument, and where the conceptual boundaries the author intended are real but not marked by explicit headings. Semantic chunking finds those unmarked boundaries and splits there instead.
The tradeoff is ingestion cost. Every sentence requires an embedding pass at index time — typically using a lightweight model such as sentence-transformers/all-MiniLM-L6-v2 — which is meaningfully slower and more expensive than pure text splitting. For a large corpus, this can make initial ingestion two to four times slower than fixed chunking. For most production systems, this cost is paid once and then amortised across all queries, so it is usually acceptable; the exception is corpora that change continuously and require near-real-time re-indexing, where the ingestion overhead matters every cycle.
Semantic chunking also offers little benefit on already-structured documents. A legal contract with numbered clauses does not need an embedding model to find its natural splits — the clause numbers are right there. Running semantic chunking on a document whose structure is already explicit wastes ingestion compute without improving boundary quality. For those documents, hierarchical chunking is the correct choice.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# SemanticChunker from langchain_experimental uses embedding similarity
# to detect topic shifts between adjacent sentences.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # split at the sharpest similarity drops
breakpoint_threshold_amount=95, # top 5% of drops become boundaries
)
with open("document.txt") as f:
text = f.read()
chunks = chunker.create_documents([text])
print(f"{len(chunks)} semantic chunks created")
# Add metadata as with fixed chunking
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": "document.txt",
"chunk_index": i,
"strategy": "semantic",
})
If you prefer a lighter-weight embedding model for the boundary detection — reserving your production embedding model for retrieval — sentence-transformers/all-MiniLM-L6-v2 is a fast, open-source option that runs on CPU. It does not match the boundary precision of a larger model, but it handles the topic-shift detection task well at a fraction of the cost. Run both on a sample of your corpus and compare chunk quality before committing the ingestion budget to a larger model.
The breakpoint_threshold_type parameter controls the split sensitivity. The "percentile" option splits at the sharpest similarity drops across the whole document, which adapts to the specific document's similarity distribution. The "standard_deviation" and "interquartile" options apply corpus-level statistics instead. For varied corpora — a mix of long and short documents, dense and sparse prose — the percentile approach usually produces more consistent chunk sizes.
Strategy 3 — Hierarchical chunking
Hierarchical chunking stores two granularities of chunk from the same document simultaneously. Parent chunks — typically 1024 tokens — capture a wide context window that preserves surrounding argument and structure. Child chunks — typically 256 tokens — provide the precise, focused unit used for embedding and retrieval. When a child chunk matches a query, the pipeline fetches its parent for the generation step. The LLM receives broad context; the embedding model operates on a tight, specific unit.
This is the approach benchmarked at 3 to 5 times F1 improvement over fixed chunking on structured documents — PDFs with section hierarchies, legal contracts, technical manuals, API reference documentation. The improvement is not magic. Fixed chunking on these documents cuts through the hierarchical structure at arbitrary points, creating chunks that embed ambiguously. Hierarchical chunking preserves the structure explicitly: child chunks align with leaf-level sections and paragraphs; parent chunks correspond to the containing section. The embedding space reflects the document's actual architecture, and retrieval benefits accordingly.
The implementation in LlamaIndex uses HierarchicalNodeParser, which walks the document tree and creates node objects at multiple granularities with explicit parent-child links. At query time, the pipeline retrieves child nodes and then resolves each to its parent before passing context to the LLM.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import (
HierarchicalNodeParser,
get_leaf_nodes,
)
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core import StorageContext
# 1. Load documents
documents = SimpleDirectoryReader("./docs/").load_data()
# 2. Parse into hierarchical nodes: chunk_sizes defines the hierarchy levels
# [1024] = parent, [256] = child (leaf nodes used for retrieval)
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[1024, 256]
)
nodes = node_parser.get_nodes_from_documents(documents)
# 3. Separate leaf nodes (child chunks) from all nodes
leaf_nodes = get_leaf_nodes(nodes)
print(f"Total nodes: {len(nodes)} | Leaf nodes (for retrieval): {len(leaf_nodes)}")
# 4. Store ALL nodes in docstore (parent resolution needs them at query time)
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
# 5. Index only the leaf nodes — retrieval operates on child chunks
storage_context = StorageContext.from_defaults(docstore=docstore)
index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)
# 6. Query — AutoMergingRetriever fetches children, returns merged parent context
from llama_index.core.retrievers import AutoMergingRetriever
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
base_retriever,
storage_context,
verbose=True,
)
nodes_result = retriever.retrieve("What are the termination clauses?")
The AutoMergingRetriever handles the parent resolution automatically. When multiple child chunks from the same parent are retrieved — indicating that several sections of the parent are relevant to the query — the retriever merges them back into the parent and returns the full parent context rather than fragmented children. This is especially powerful for structured documents where a query might touch several sub-clauses of the same section.
For documents that lack explicit section headers — where the hierarchy must be inferred from layout rather than tagged structure — combine hierarchical chunking with a document parser such as Unstructured.io, which extracts headings, tables, and list items before chunking begins. Parsing at the element level and then applying the parent-child split on those elements produces a cleaner hierarchy than applying the splitter to raw text.
Build your golden eval set from real user queries before you touch your chunking strategy. Running RAGAS on 50 queries takes 10 minutes and will save you weeks of debugging retrieval failures. Without a baseline measurement, you cannot tell whether a strategy change was an improvement or a regression.
Benchmark comparison
The following table summarises the production trade-offs across all three strategies. F1 figures are approximate and drawn from benchmarks on realistic corpora; your numbers will vary depending on document type, query distribution, and embedding model. The table is intended as a decision aid, not a guarantee — run your own eval set to confirm.
| Strategy | Best document type | F1 — factual lookup | F1 — summarisation | Ingestion cost | Query latency |
|---|---|---|---|---|---|
| Fixed (400–600 tok, 15% overlap) | Homogeneous docs: wiki, news, support tickets | 0.62–0.71 | 0.55–0.65 | Low (text split only) | Low |
| Semantic (similarity-threshold splits) | Narrative text: emails, long-form prose, research papers | 0.68–0.76 | 0.67–0.78 | Medium (embed at ingestion) | Low (no extra query cost) |
| Hierarchical (1024 parent / 256 child) | Structured docs: PDFs with headers, legal contracts, manuals | 0.81–0.89 | 0.74–0.83 | Low–medium (two-level split) | Low–medium (parent fetch) |
Two patterns stand out. First, hierarchical chunking's advantage over fixed is largest on factual lookups against structured documents — exactly the query type that most enterprise RAG deployments need to handle reliably. A user asking "what is the termination notice period in clause 14.2?" is a factual lookup against a structured document. Fixed chunking that splits clause 14.2 across two windows will fail this query with high probability; hierarchical chunking that preserves the clause as a child chunk will surface it cleanly.
Second, semantic chunking's advantage is most visible on summarisation queries against narrative text — where the question requires synthesising several related arguments rather than locating a single fact. A support ticket that weaves a complaint through five paragraphs before reaching the actual error message benefits from semantic splitting that keeps related sentences together, rather than a fixed window that may separate the error message from its context.
The practical rule of thumb: start with fixed chunking as your baseline. If your evaluation shows poor F1 on structured documents, switch to hierarchical. If it shows poor F1 on narrative documents, try semantic. Run RAGAS on each and let the numbers decide. Our guide to hybrid retrieval for production RAG covers the retrieval and re-ranking steps that sit downstream of whichever chunking strategy you choose.
Embedding model selection
The embedding model choice interacts with your chunking strategy in two ways. First, the model's effective context window limits how large a chunk can be before the embedding starts to degrade — most production embedding models handle 512 to 8192 tokens well, but quality drops if you habitually push to the limit. Second, the model's training domain affects how well it handles your specific vocabulary — a model trained on web text may embed medical or legal terminology less precisely than one fine-tuned on domain data.
The three models most commonly benchmarked in production RAG pipelines as of mid-2026 are:
| Model | Dimensions | Context window | Licence | Best suited to |
|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 (truncatable) | 8191 tokens | Commercial API | General English corpora; fast and cheap at scale |
| bge-large-en-v1.5 (BAAI) | 1024 | 512 tokens | MIT (open source) | Self-hosted pipelines; strong MTEB scores; free at inference |
| cohere-embed-v3 (Cohere) | 1024 | 512 tokens | Commercial API | Multilingual corpora; built-in compression support |
The critical rule is: benchmark embedding models on your domain data, not on leaderboard rankings. MTEB and similar benchmarks measure average performance across a wide range of tasks and domains. Your corpus is one specific domain, and the model that ranks highest on average may not be the best for your particular vocabulary, query distribution, or document structure. The practical approach is to run your golden eval set — the same 50 to 100 examples you use for chunking evaluation — against each candidate embedding model and compare context recall and faithfulness. This takes an afternoon and is more reliable than any leaderboard.
One technique worth knowing for large-scale deployments is Matryoshka embeddings, supported by both text-embedding-3-small and bge-large-en-v1.5. These models store embeddings at multiple dimensionality levels — for example, 1536, 768, 256 — within the same embedding vector. At query time, you can truncate to a lower dimension for approximate nearest-neighbour search (much faster and cheaper) and only expand to the full dimension for re-ranking the top candidates. This pattern can reduce vector search latency by 40 to 60 per cent with minimal quality loss, which matters when you are serving hundreds of simultaneous queries. For further guidance on embedding cost trade-offs in the broader context of LLM infrastructure, our guide to LLM cost optimisation covers the full picture.
The production reference architecture
The mature production RAG pipeline as of mid-2026 looks like this:
Documents (PDF, DOCX, HTML, plain text)
│
▼
[ Unstructured.io ] ← parse: extract text, tables, headings, metadata
│
▼
[ Chunking strategy ] ← fixed / semantic / hierarchical (your decision above)
│
▼
[ Embedding model ] ← text-embedding-3-small / bge-large-en-v1.5 / cohere-embed-v3
│
▼
[ Vector store ] ← pgvector (Postgres) / Qdrant / Pinecone
+ [ BM25 index ] ← Elasticsearch / OpenSearch / Postgres full-text
│
▼ (at query time)
[ Hybrid search ] ← dense + sparse in parallel
│
▼
[ RRF fusion ] ← Reciprocal Rank Fusion (k ≈ 60)
│
▼
[ Re-ranker ] ← Cohere rerank-v3 / cross-encoder (conditional)
│
▼
[ LLM ] ← generation with cited context
Tool choices at each stage follow predictable patterns. For parsing, Unstructured.io handles the widest range of input formats and extracts structural metadata (headings, table content, figure captions) that enables hierarchical chunking to work correctly. For chunking, LangChain's text splitters cover fixed and semantic strategies well; LlamaIndex's HierarchicalNodeParser is the best-maintained hierarchical implementation. For the vector store, the choice between pgvector, Qdrant, and Pinecone depends primarily on your existing infrastructure and scale: pgvector is the pragmatic default for teams already running Postgres and wanting to avoid a new service; Qdrant offers more sophisticated filtering and payload indexing for complex metadata queries; Pinecone is the fully managed option for teams who want zero operational overhead. Our news analysis of Pinecone, Qdrant, pgvector, and Turbopuffer covers the trade-offs in detail.
The re-ranking step — a cross-encoder that scores query-chunk pairs directly — is the most expensive per-query component. As covered in the hybrid retrieval guide, the production move is conditional re-ranking: skip it when the top fused candidate scores confidently, apply it only when retrieval looks uncertain. Cohere's rerank-v3 is the most widely used managed option; cross-encoder/ms-marco-MiniLM-L-6-v2 from Hugging Face is the standard self-hosted alternative.
Do not chunk at query time. Some early RAG implementations re-chunk the retrieved documents on every query before passing them to the LLM. This multiplies latency with no retrieval benefit — the chunk boundaries at query time are arbitrary and do not correspond to the embedding space your vector index was built on. Chunk at ingestion, index the chunks, and retrieve them directly. Re-chunking at query time is a performance anti-pattern with no upside.
How to evaluate your chunking strategy
The only way to know whether a chunking strategy is working is to measure it. The golden eval set approach is the production standard: assemble 50 to 100 (question, ground-truth answer) pairs drawn from real user queries or realistic synthetic queries that reflect your actual query distribution. These pairs must cover the range of query types your pipeline handles — factual lookups, summarisation requests, multi-hop questions — because a strategy that is excellent on one type may be poor on another.
Run each chunking strategy against this eval set using the RAGAS framework. The four metrics that matter are:
- Context Precision: of the chunks retrieved, what fraction are actually relevant to the query? A low score means too much noise is reaching the LLM — tighten chunking or retrieval.
- Context Recall: of the chunks needed to answer correctly, what fraction were retrieved? A low score means evidence is missing from the retrieved set — fix chunking boundaries or increase retrieval depth.
- Faithfulness: is the generated answer grounded in the retrieved context, or does the model add claims not present in the chunks? The most important metric for production safety.
- Answer Correctness: does the final answer match the reference? The end-to-end metric; diagnose failures with the three above.
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_correctness,
)
from datasets import Dataset
# Your golden set: 50–100 real or realistic queries with reference answers
golden = Dataset.from_dict({
"question": questions, # list[str]
"answer": pipeline_answers, # your RAG pipeline's generated answers
"contexts": retrieved_chunks, # list[list[str]] — what retrieval returned
"ground_truth": reference_answers, # known-correct reference answers
})
report = evaluate(
golden,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_correctness,
],
)
print(report)
# Suggested minimum gates before switching strategies in production:
# context_recall >= 0.80 (the evidence must reach the LLM)
# faithfulness >= 0.88 (the model must not hallucinate beyond its sources)
# answer_correctness >= 0.75 (end-to-end quality)
Context recall is the key diagnostic for chunking specifically. A low context recall score means the answer evidence is missing from the retrieved chunks — which almost always points to a boundary problem rather than a retrieval algorithm problem. If the relevant passage was split across two chunks and neither half is useful alone, no amount of BM25 tuning or RRF parameter adjustment will fix it. Context recall below 0.75 is a signal to revisit your chunking strategy before touching anything else downstream.
Our guide to building an LLM evaluation suite with golden sets and judges covers the broader evaluation framework in depth, including how to construct golden sets from production logs and how to use LLM-as-judge approaches to scale annotation. The RAGAS metrics here sit within that larger framework and can be wired into CI to gate chunking changes automatically. For agent-based RAG pipelines where the retrieval step is embedded in a multi-step reasoning loop, the memory management patterns in our guide to agent memory management are relevant to how retrieved context is stored and accessed across turns.
"We were running fixed 512-token chunking on a corpus of 40,000 legal contracts and our factual F1 was sitting at 0.61. We switched to hierarchical chunking — 1024-token parents, 256-token children — and our factual F1 went to 0.84 in the same week, with no change to the embedding model, re-ranker, or LLM. The golden eval set told us immediately that context recall was the problem, and the switch confirmed it. I wish we had measured earlier rather than spending two months tuning the generation prompt."
The five-question audit before changing your strategy
Before committing to a chunking strategy switch in production, run through this diagnostic checklist. It will surface the most common misdiagnoses and save you from making an expensive infrastructure change that does not address the actual problem.
- What document types make up your corpus? Homogeneous and narrative-heavy corpora suit fixed or semantic chunking. Structured documents with clear hierarchies suit hierarchical chunking. If your corpus is mixed, you may need multiple strategies applied conditionally by document type.
- What query types dominate your traffic? Factual lookups benefit from smaller, precise child chunks. Summarisation queries benefit from wider context. If both types are common, hierarchical chunking's parent-child structure handles both better than either fixed or semantic alone.
- What is your current context recall score? If it is above 0.85, your chunk boundaries are not the problem — look at retrieval depth, fusion weighting, or re-ranking thresholds instead. If it is below 0.75, chunking is almost certainly the root cause.
- What is your latency budget? Semantic chunking adds ingestion cost but no query-time overhead. Hierarchical chunking adds a parent-fetch step at query time, typically 5 to 15ms depending on your store. If you are operating under strict query latency constraints, measure the overhead before committing.
- Do you have a golden eval set? If the answer is no, build it before changing anything. A strategy switch without a baseline measurement is a guess. A 50-query golden set takes two to three hours to assemble from production logs and gives you a reliable before-and-after signal for every subsequent change.
If you are building AI systems that implement these patterns — whether RAG pipelines, evaluation frameworks, or production LLM infrastructure — the AI Tech Connect Builder profiles are where that work gets visible to the people hiring and collaborating across India and the UK. Submitting your work to the community is how the directory grows and how builders get found.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders, and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →