Is hybrid retrieval really better than vector-only search?

For most production corpora, yes. Hybrid search combines dense vector search, which captures meaning and paraphrase, with sparse BM25, which captures exact keywords, product codes, error strings and proper nouns. As of June 2026 it is the single biggest quality jump over a naive vector-only pipeline and the default starting point for serious systems. Vector-only search quietly fails on the precise tokens a vector embedding tends to smooth over, and those are exactly the terms enterprise users search for.

What is the ideal chunk size for RAG?

There is no universal number, but 256 to 512 tokens is a common starting point. Larger chunks of around 1024 tokens carry more context per hit but make matching coarse and dilute the embedding; smaller chunks match precisely but can lose the surrounding context an answer needs. Chunking is the most under-estimated lever in a RAG pipeline, so the honest answer is to start around 256 to 512 and tune on your own corpus with an evaluation harness such as Ragas or DeepEval rather than trusting a default.

Do I always need a re-ranker in my RAG pipeline?

No. A cross-encoder re-ranker improves ordering but adds latency and cost on every query. A practical pattern is conditional re-ranking: only re-rank when the top fused candidate scores below a confidence threshold, which means uncertain retrieval. On confident queries you skip the re-ranker entirely and save the compute. Many production systems start without a re-ranker, add one when evaluation shows ordering is the bottleneck, and then gate it behind a threshold to keep costs sensible.

How do I measure whether my RAG answers are faithful?

Use a retrieval-aware evaluation framework such as Ragas, DeepEval or Promptfoo over a golden set of 200 to 500 representative question-and-answer examples. The core metrics are context precision and context recall, which test the retriever, plus faithfulness, which checks that the generated answer is supported by the retrieved context rather than invented, and answer correctness against a known reference. Measuring faithfulness before you ship is what separates a demo from a production system, because it catches the case where retrieval is fine but generation hallucinates.

How often should I refresh or re-index my RAG corpus?

Match the refresh cadence to how fast the underlying content changes. Daily re-indexing suits dynamic content such as product catalogues and compliance documents, while near-real-time or hourly refresh suits support tickets and news. Static reference material can be re-indexed far less often. The failure mode to avoid is a stale index that confidently cites a policy or price that changed last week, so tie your refresh schedule to the document type rather than picking one global interval.

Hybrid Retrieval for Production RAG: BM25, Vectors and Re-ranking, Step by Step

What hybrid retrieval actually fixes

The first version of almost every RAG system looks the same: embed the documents, embed the question, retrieve the nearest vectors, stuff them into a prompt. It demos beautifully and then disappoints in production, and the reason is almost always retrieval rather than the model. Dense vector search is excellent at meaning — it understands that "how do I cancel my plan" and "stopping my subscription" are the same request — but it is quietly poor at the exact tokens enterprise users actually type. Product codes, error strings, an invoice number, a clause reference, the name of a specific NHS framework or a GST circular: these are the precise terms a vector embedding tends to smooth over, because embeddings are built to generalise, not to match characters.

Hybrid retrieval fixes this by running two retrievers in parallel and combining them. Dense vectors handle semantics; sparse BM25 keyword search handles exact lexical matches. As of June 2026, combining the two is the single biggest quality jump you can make over a naive pipeline, and it is the default shape for any system that has to survive real users rather than a curated demo. The intuition is that the two methods fail in different places, so their union covers far more ground than either alone — BM25 rescues the precise-token queries that vectors miss, and vectors rescue the paraphrased queries that keyword search misses.

It is worth being precise about what hybrid retrieval does and does not solve, because it is easy to over-claim. It will not make a weak generation model reason better, it will not rescue a corpus that simply lacks the answer, and it will not fix chunks that were split so badly the relevant text never sits together. What it reliably fixes is the recall gap at the very front of the pipeline — the queries where the right passage existed in your corpus but the retriever failed to surface it. Because every downstream step, including the language model itself, can only work with what retrieval hands it, closing that gap pays off more than almost any other single change. A faithful model answering from the wrong passage is still wrong; hybrid retrieval is how you raise the odds it gets the right passage in the first place.

The rest of this guide follows the mature production RAG shape end to end: corpus inventory and chunking, building the two retrievers, fusing them with Reciprocal Rank Fusion, re-ranking conditionally, measuring faithfulness on a golden set, and then the operational concerns — refresh cadence, citation UX and observability — that decide whether the thing keeps working a month after launch. If you want the broader landscape first, our overview of hybrid search RAG with BM25 and vector search in production and the wider RAG in production 2026 playbook both set the scene well. Everything below is meant to be implemented, not just read.

Pre-requisites: corpus inventory and chunking the right way

Before you write a line of retrieval code, do two unglamorous things, because they determine the ceiling on everything that follows. The first is a corpus inventory: an honest list of what you are actually indexing. How many documents, in what formats, how often do they change, who owns them, and which ones carry compliance weight. An Indian fintech indexing RBI circulars and its own product catalogue has a very different refresh profile from a UK NHS supplier indexing clinical guidance and contract templates — the catalogue moves weekly, the circulars and guidance must never be stale. Writing this down now prevents the most common late-stage surprise, which is discovering that a quarter of your corpus is a format your loader silently dropped.

The second is chunking, and it is the single most under-estimated lever in the whole pipeline. A chunk is the unit you embed and retrieve, and its size trades two things against each other. Chunks that are too large make matching coarse — the embedding has to average over too many ideas, so it represents none of them sharply, and a retrieved hit drags in paragraphs of irrelevant text. Chunks that are too small match precisely but lose the surrounding context an answer needs, so the model retrieves the right sentence with none of the framing that makes it usable. As of June 2026, a chunk size of 256 to 512 tokens is the common starting point, but the word that matters is starting: the right value is a property of your corpus, not a constant.

Chunk size	Recall behaviour	Context quality	Best suited to
256 tokens	High precision; sharp matches on specific facts, but a single chunk may miss surrounding context	Focused; little dilution, but answers spanning two chunks can fragment	FAQs, short policies, structured reference data, support snippets
512 tokens	Balanced; the common default that captures a complete idea without much noise	Good; usually holds one coherent point with its immediate context	Most general corpora — a sensible place to start tuning
1024 tokens	Lower precision; coarse matching as the embedding averages over many ideas	Rich per hit, but dilutes the signal and wastes prompt budget on filler	Long-form narrative, legal contracts, documents needing wide context

The discipline is to treat chunk size as a tuned parameter, not a guess. Pick 512 as a default, build a small evaluation set (we get to this in step four), then try 256 and 1024 and measure context recall and answer correctness on each. The numbers will tell you which way your particular corpus leans far more reliably than any rule of thumb. Our dedicated 2026 chunking playbook goes deeper on overlap, parent-child chunking and document-aware splitting if you want to push past a fixed window.

Two further habits separate corpus inventory done well from corpus inventory done as a tick-box. The first is to chunk along the document's own structure rather than against it. A fixed token window that slices through the middle of a table, a numbered clause or a code block produces chunks that read as nonsense to both retrievers, so prefer splitting on natural boundaries — headings, list items, clause numbers — and only then pad to your target size. For an Indian fintech indexing structured circulars, or a UK NHS supplier indexing numbered clinical guidance, the clause is the meaningful unit, and respecting it lifts retrieval quality more than any amount of fiddling with the window size. The second habit is to attach metadata to every chunk at index time: source document, section, last-modified date, and any access-control tags. That metadata is what lets you filter retrieval to the right tenant, show an honest citation later, and expire stale chunks on the right schedule. Skipping it is the kind of shortcut that is cheap on day one and expensive every day after.

Pro tip

Add a small overlap — roughly 10 to 15 per cent of the chunk size — so that an idea straddling a boundary is not severed in half. A sentence that begins at the end of one chunk and finishes at the start of the next is otherwise invisible to both retrievers. Overlap costs a little storage and almost nothing in latency, and it routinely lifts recall on documents with dense, continuous prose.

Step 1 — Build the two retrievers: BM25 + dense vectors

With chunks in place, build the two retrievers. The dense side is a vector similarity search; the sparse side is a classic BM25 keyword search. The choice of where the vectors live matters less than the fact that both run and return comparable result sets — for a comparison of the options, our piece on Pinecone, Qdrant, pgvector and Turbopuffer lays out the trade-offs. The example below uses pgvector on Postgres, which keeps both retrievers in one database and is a pragmatic default for teams already running Postgres on an AWS Mumbai or London region.

The key idea is that the two queries run in parallel and each returns its own ranked list. We do not try to make the scores comparable yet — BM25 scores and cosine distances live on different scales, and forcing them onto one axis is exactly the mistake fusion is designed to avoid. Each retriever simply produces an ordered list of chunk IDs.

import asyncio
import asyncpg

# Dense retriever: pgvector cosine distance (<=> operator).
# query_embedding is a list[float] from your embedding model.
async def dense_search(conn, query_embedding, k=20):
    rows = await conn.fetch(
        """
        SELECT id, content,
               1 - (embedding <=> $1) AS similarity
        FROM   chunks
        ORDER  BY embedding <=> $1
        LIMIT  $2
        """,
        query_embedding, k,
    )
    return [(r["id"], r["content"]) for r in rows]

# Sparse retriever: BM25 via Postgres full-text ranking.
# (Use a dedicated BM25 extension or search engine in heavier setups.)
async def bm25_search(conn, query_text, k=20):
    rows = await conn.fetch(
        """
        SELECT id, content,
               ts_rank_cd(search_vector, plainto_tsquery('english', $1)) AS rank
        FROM   chunks
        WHERE  search_vector @@ plainto_tsquery('english', $1)
        ORDER  BY rank DESC
        LIMIT  $2
        """,
        query_text, k,
    )
    return [(r["id"], r["content"]) for r in rows]

async def retrieve(conn, query_text, query_embedding, k=20):
    # Run both retrievers concurrently — they are independent.
    dense, sparse = await asyncio.gather(
        dense_search(conn, query_embedding, k),
        bm25_search(conn, query_text, k),
    )
    return dense, sparse   # two ranked lists, fused in Step 2

A few practical notes. Retrieve more candidates per side than you ultimately need — fetching the top 20 from each retriever and narrowing later gives fusion and re-ranking room to work, whereas fetching the top 3 from each throws away exactly the borderline candidates those later stages exist to rescue. Run the two queries concurrently, as above, so the slower of the two sets your latency floor rather than their sum. And keep both retrievers reading from the same chunk table, so an ID returned by one always resolves to the same content as the other — that shared key is what makes fusion in the next step trivial.

If you are using a managed vector database rather than pgvector, many of them now expose a single hybrid endpoint that runs both retrievers for you. That is convenient, but understanding the two-list structure underneath still matters, because the fusion and re-ranking decisions are yours to make regardless of who runs the queries.

Step 2 — Fuse the results with Reciprocal Rank Fusion

Now you have two ranked lists whose scores are not comparable. The clean, parameter-free way to merge them is Reciprocal Rank Fusion (RRF). Instead of trying to normalise a BM25 score against a cosine similarity — a fragile exercise that breaks the moment your data shifts — RRF ignores the raw scores entirely and uses only the rank of each document in each list. A document's fused score is the sum, across every list it appears in, of 1 / (k + rank), where k is a small constant that dampens the influence of the very top positions. The standard value is k ≈ 60, and it works well enough across enough corpora that it has become the default.

The beauty of RRF is what it rewards. A document that ranks first in one retriever and is absent from the other gets a solid score. A document that ranks, say, third in both retrievers — present in both the semantic and the lexical view of relevance — often beats it, because two moderate agreements outweigh one strong-but-lonely signal. That is precisely the behaviour you want from hybrid retrieval: consensus across two different notions of relevance is a stronger signal than excellence in just one.

def reciprocal_rank_fusion(ranked_lists, k=60, top_n=10):
    """Fuse multiple ranked lists of document IDs using RRF.

    ranked_lists: list of lists; each inner list is doc IDs in rank order
                  (best first), e.g. [dense_ids, sparse_ids].
    k:            RRF constant; ~60 is the standard, parameter-free default.
    Returns the top_n doc IDs by fused score.
    """
    scores = {}
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)

    fused = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
    return [doc_id for doc_id, _ in fused[:top_n]]

# Usage with the two lists from Step 1:
dense_ids  = [doc_id for doc_id, _ in dense]
sparse_ids = [doc_id for doc_id, _ in sparse]
fused_ids  = reciprocal_rank_fusion([dense_ids, sparse_ids], k=60, top_n=10)

Two things make RRF a pleasure to operate. First, it has effectively one parameter, and k ≈ 60 almost never needs touching, so there is no per-corpus score-normalisation to maintain as your data drifts. Second, it extends for free: if you later add a third retriever — a title-only index, a metadata filter, a different embedding model — you simply pass a third ranked list and the maths absorbs it. After this step you hold a single fused list of, say, the top 10 chunk IDs, ranked by agreement across both retrievers, and you are ready to decide whether they need re-ordering at all.

Step 3 — Re-rank, and when to skip it

A re-ranker is a more expensive model — typically a cross-encoder — that reads the query and each candidate chunk together and scores how well they actually match, rather than relying on independently computed embeddings. It is the most accurate relevance judgement in the pipeline, and on hard queries it meaningfully improves the order of your fused list. The catch is cost: a cross-encoder runs the query against every candidate, adding latency and compute to every request if you apply it blindly.

The production move is conditional re-ranking: only re-rank when the system is uncertain. After fusion, look at the top candidate's score. If it is comfortably above a confidence threshold, retrieval is already confident — the same document topped both retrievers, fusion agrees, and re-ranking will almost certainly not change the answer — so you skip the re-ranker and save the compute. If the top score is below the threshold, retrieval is unsure, and that is exactly when the extra accuracy earns its cost.

Fused top-score signal	Interpretation	Action	Effect
Above threshold (confident)	Both retrievers strongly agree on the top chunk	Skip the re-ranker; pass the fused order straight through	Lower latency and cost; ordering already good
Below threshold (uncertain)	Weak or split agreement; the right chunk may be ranked low	Run the cross-encoder re-ranker over the fused candidates	Higher accuracy where it matters; cost spent deliberately
Near-empty result set	Neither retriever found much — likely out of scope	Trigger a fallback or an honest "I don't know"	Avoids confidently answering from irrelevant context

def maybe_rerank(query, candidates, fused_scores, reranker,
                 threshold=0.30, top_k=5):
    """Conditionally re-rank only when retrieval looks uncertain.

    candidates:   list of (doc_id, content) in fused order.
    fused_scores: dict of doc_id -> fused RRF score.
    threshold:    illustrative confidence cut-off — tune on your eval set.
    """
    if not candidates:
        return []   # caller handles the empty / out-of-scope case

    top_score = fused_scores[candidates[0][0]]
    if top_score >= threshold:
        return candidates[:top_k]          # confident: skip the re-ranker

    # Uncertain: score each candidate against the query with a cross-encoder.
    scored = reranker.score(query, [c for _, c in candidates])
    reordered = [c for _, c in sorted(
        zip(scored, candidates), key=lambda x: x[0], reverse=True)]
    return reordered[:top_k]

The threshold of 0.30 above is illustrative — the correct value is whatever your evaluation set says separates confident from uncertain retrieval for your corpus, and it is worth tuning, because set it too high and you re-rank everything (paying full cost), too low and you skip cases that needed help. A sensible way to find it is to run your golden set both with and without re-ranking, look at the queries the re-ranker actually rescued, and place the threshold just above where rescues stop mattering. Done well, conditional re-ranking gives you most of the accuracy of always-on re-ranking at a fraction of the compute.

Watch out

Do not bolt a re-ranker on as a reflex. Re-ranking only re-orders the candidates retrieval already found — if the right chunk never made it into the fused list, no re-ranker can conjure it back. When evaluation shows poor results, check context recall first: if the answer chunk is missing from your candidates entirely, the fix is better chunking or retrieval, not a re-ranker. Adding a cross-encoder to paper over a recall problem just makes a broken pipeline slower and more expensive.

Step 4 — Measure faithfulness before you ship

Up to here you have been making engineering choices — chunk size, fusion, the re-rank threshold — and every one of them needs a number to justify it. That number comes from an evaluation harness, and building one is the step that most distinguishes a production RAG system from a demo. The tools to reach for as of June 2026 are Ragas, DeepEval and Promptfoo, which let you treat retrieval quality as a set of unit tests over a golden set. A production-ready golden set is typically 200 to 500 representative examples — real questions, with reference answers and, ideally, the passages that should support them.

Four metrics carry most of the weight, and they split cleanly into testing the retriever versus testing the generator. Context precision and context recall judge retrieval: did you fetch the right chunks, and did you fetch all of them. Faithfulness judges generation: is the answer actually grounded in the retrieved context, or did the model invent something plausible. Answer correctness judges the end-to-end result against a known reference. Faithfulness is the one teams most often skip and most need, because it catches the dangerous case where retrieval was fine but the model hallucinated anyway.

Metric	What it asks	What a low score tells you
Context precision	Are the retrieved chunks relevant, and ranked sensibly?	Too much noise in retrieval — tighten chunking or re-ranking
Context recall	Did retrieval fetch all the chunks needed to answer?	The answer evidence is being missed — fix chunking or retrieval
Faithfulness	Is the answer supported by the retrieved context?	The model is hallucinating beyond its sources — a citation and prompt problem
Answer correctness	Does the final answer match the reference?	End-to-end failure — diagnose with the three metrics above

from ragas import evaluate
from ragas.metrics import (
    context_precision, context_recall,
    faithfulness, answer_correctness,
)
from datasets import Dataset

# Your golden set: 200–500 rows of question / answer / contexts / reference.
golden = Dataset.from_dict({
    "question":      questions,        # list[str]
    "answer":        generated,        # your pipeline's answers
    "contexts":      retrieved_chunks, # list[list[str]] — what retrieval returned
    "ground_truth":  references,       # known-good reference answers
})

report = evaluate(
    golden,
    metrics=[context_precision, context_recall,
             faithfulness, answer_correctness],
)
print(report)   # per-metric scores; gate your release on these
# Illustrative target gate: faithfulness >= 0.90, context_recall >= 0.85

The thresholds in that final comment are illustrative starting points, not guarantees — pick your own gate values from a first baseline run, then refuse to ship a change that regresses them. Wire this harness into CI so that every change to chunk size, the fusion constant, the re-rank threshold or the prompt is scored automatically against the golden set. That turns RAG tuning from guesswork into measurement, and it is the foundation that the observability layer — covered in our guide to RAG observability in production — then extends from offline evaluation to live traffic.

Common pitfalls: refresh cadence, citation UX, the retrieval-not-generation bug

A RAG system that passes evaluation on launch day can still rot, and three failure modes account for most of the rot. The first is stale data from the wrong refresh cadence. The right re-indexing interval is a property of the document, not a single global setting. Dynamic content — product catalogues, pricing, compliance documents like the RBI circulars or NHS framework agreements mentioned earlier — wants daily refresh, because a catalogue or a policy that changed last week and was confidently cited from the old version is worse than no answer. Truly real-time sources such as support tickets and news want near-real-time or hourly refresh. Static reference material can be re-indexed far less often. Tie the schedule to the content type and the most damaging staleness disappears.

The second is weak citation UX. A production RAG answer should show its working: which chunk, from which document, supports each claim. This is not decoration. Citations let users verify a high-stakes answer themselves, they make faithfulness failures visible the moment one slips through, and in a regulated setting — a fintech or an NHS supplier — they are often a hard requirement rather than a nicety. An answer the user cannot trace is an answer they cannot trust, however correct it happens to be.

The third is the most insidious: the retrieval-not-generation bug. When an answer is wrong, the instinct is to blame the model and reach for a bigger one or a cleverer prompt. More often the fault is upstream — the right chunk was never retrieved, so the model answered faithfully from the wrong context. This is why the metric split in step four matters so much: if context recall is low, no amount of generation tuning will help, and the entire fix lives in chunking, retrieval and fusion. The practical diagnostic is quick: when an answer is wrong, first inspect the chunks that were actually retrieved for that query. If the supporting passage is absent from them, the problem is retrieval and you should look at chunking, the fusion list size and recall; if the passage is present but the answer ignored or contradicted it, the problem is generation, and a clearer prompt, a citation requirement or a stronger model is the right lever. Conflating the two is how teams spend a month tuning the wrong half of the pipeline.

One more lever worth knowing as you scale: semantic caching, which serves a stored answer when a new query is close enough to a previous one, can cut LLM costs by up to roughly 68.8 per cent in typical production workloads — a substantial saving once your traffic is repetitive, as support and FAQ traffic almost always is. Treat the cache as a tunable rather than a switch, though: set the similarity threshold for a cache hit too loosely and you will serve a near-but-wrong answer, which is far costlier than the compute you saved. Pair caching with the same golden-set evaluation you built in step four, so a cache that starts returning subtly mismatched answers shows up as a faithfulness regression rather than a silent quality leak. Cost optimisation and correctness are not opposing goals here; the discipline that protects one protects the other.

Conclusion and next steps

The mature 2026 RAG pipeline is not exotic. It is a disciplined sequence: inventory the corpus, chunk thoughtfully around 256 to 512 tokens and tune from there, run BM25 and dense vectors in parallel, fuse them with Reciprocal Rank Fusion at k ≈ 60, re-rank only when retrieval is unsure, and gate every release on faithfulness and recall measured over a real golden set. Each step is small; together they are the difference between a demo that impresses once and a system that holds up under real users in a Mumbai or London region for months. For the wider research frontier — hierarchical and agentic retrieval — our summary of the recent agentic RAG and hierarchical retrieval papers is the natural next read.

If you are building exactly this kind of production RAG system, that work is worth showing. A Verified Builder profile on AI Tech Connect gathers your shipped projects in one place where the people hiring across India and the UK are already looking — and the same proof-of-work mindset that makes a good portfolio, covered in our guide to the AI engineer portfolio that gets you hired, is exactly what a strong RAG project demonstrates.

Every article here is written by a Verified Builder. Want your name on the next one?

AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.

Become a Verified Builder →