The retrieval problem most RAG teams ignore

The majority of RAG failures are not hallucination failures. They are retrieval failures. Research across enterprise document Q&A deployments consistently places the fraction of bad answers attributable to retrieval — wrong documents returned, relevant documents missed, rank order confused — at a clear majority (per production analysis, frequently cited around 70% or higher). The LLM never had a chance: it was reasoning over the wrong evidence from the start.

Vector-only retrieval, which has been the de facto default since embedding models became cheap and fast, is excellent at semantic similarity. Ask a question in plain language and a well-tuned embedding model will surface conceptually related passages even when the wording is completely different. That strength is real. The weakness is equally real, and it consistently catches teams out in production: vector search struggles with exact terms.

Product codes, regulation identifiers, model numbers, error strings, software version numbers, proper nouns in specialist domains, rare technical abbreviations — these are the backbone of enterprise document corpora in sectors from financial services to manufacturing. An embedding model trained on general web text does not reliably distinguish "ISO 27001" from "ISO 27002" or "EC Regulation 2016/679" from "UK GDPR". BM25 — the 30-year-old term-frequency / inverse-document-frequency scoring function — gets these right every time, because it counts exact character matches.

The fix is combining both. Hybrid search — running BM25 and dense vector retrieval in parallel, then merging the result lists — is not a new idea. What has changed is the engineering tooling around it. Libraries like rank_bm25, vector stores with native BM25 support (Weaviate, Elasticsearch, OpenSearch), and framework-level hybrid retrievers in LlamaIndex and LangChain have made this a day's work rather than a week's.

This guide covers the full picture: why the failure modes occur, how Reciprocal Rank Fusion (RRF) merges the lists, how re-ranking the merged candidates sharpens quality further, and what a production-ready hybrid pipeline looks like for Indian and UK enterprise teams building document Q&A systems.

Pro tip

Before refactoring your retrieval stack, add logging that captures, for every query: the top-10 BM25 candidates, the top-10 vector candidates, and the final top-k passed to the LLM. Run this for two weeks in production. The overlap — or absence of it — between the two lists will tell you exactly how much you stand to gain from hybrid fusion, and whether re-ranking is worth the additional latency.

Why vector-only RAG fails on exact terms

Dense embedding models map text into a high-dimensional vector space such that semantically similar passages cluster together. The cosine distance between "what is the penalty for GDPR non-compliance" and "fines under the General Data Protection Regulation" is small, correctly so. The embedding model has learned that these phrases carry the same meaning.

The failure mode is subtler. "Article 83(4) GDPR" and "Article 83(5) GDPR" describe penalties in different tiers: the former caps at €10m or 2% of global turnover, the latter at €20m or 4%. Their embeddings are very close — both are about GDPR penalties — but the distinction matters enormously to a legal team. BM25 treats "83(4)" and "83(5)" as different tokens and retrieves the correct article when the user quotes the identifier directly.

The same dynamic recurs across domains. An engineer querying for "error code E0042" gets poor results from a vector retriever trained on documentation prose because the model has never learned that E0042 is meaningfully different from E0043. A procurement analyst searching for supplier part number "XZ-7712-B" is out of luck. A compliance officer at a UK financial firm looking for PS24/6 — the FCA policy statement — is searching for a precise identifier in a terminology space that embedding models compress into near-identical vectors.

The argument for keeping vector search alongside BM25 is equally strong. Natural language questions — "how does our leave policy handle parental bereavement?" — have no exact keyword anchors. The user does not know the section heading. Pure BM25 retrieval fails here because it only scores on term overlap, and the user's phrasing may share no exact tokens with the relevant passage. Vector search surfaces the right passage across paraphrase gaps that BM25 cannot bridge.

Reciprocal Rank Fusion: the formula and the k parameter

Reciprocal Rank Fusion is a rank-combination algorithm published by Cormack, Clarke, and Buettcher in 2009. It requires no score normalisation — the raw similarity scores from BM25 and from a cosine-distance vector search are not on the same scale and cannot be meaningfully averaged. RRF sidesteps this entirely by operating only on rank positions.

The formula for the RRF score of document d across retrieval systems R1 ... Rn is:

RRF(d) = sum over each ranker R of: 1 / (k + rank_R(d))

where:
  k     = smoothing constant, typically 60
  rank  = 1-based position of document d in ranker R's result list
         (documents not in the list are excluded or scored as infinity)

Documents that appear near the top of multiple lists accumulate high scores. A document ranked 1st by BM25 and 2nd by vector search scores 1/(60+1) + 1/(60+2) ≈ 0.0328. A document ranked 3rd by only one ranker scores 1/(60+3) ≈ 0.0159. The fusion naturally promotes documents with multi-signal agreement.

The k parameter controls the curvature of the scoring function. At low values of k (say, 1), the score difference between rank 1 and rank 5 is enormous (1.0 vs 0.17). At k=60, the difference is modest (0.0164 vs 0.0154). Higher k makes the function flatter — small rank differences matter less. The default of 60 emerged from the original paper's experiments across IR benchmarks and holds well in practice. Adjust it only if you have a held-out evaluation set and evidence that your rank lists have a specific shape that favours a different value.

RRF implementation in Python

The core RRF merge is straightforward. The snippet below shows a standalone implementation using rank_bm25 for lexical retrieval and sentence-transformers for dense retrieval, with RRF fusion producing a unified ranked list.

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict

# --- index construction ---
corpus = [doc["text"] for doc in documents]  # list of strings

tokenised = [text.lower().split() for text in corpus]
bm25 = BM25Okapi(tokenised)

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(corpus, normalize_embeddings=True)

# --- hybrid retrieval with RRF ---
def hybrid_search(query: str, top_k: int = 50, rrf_k: int = 60) -> list[dict]:
    """Return top_k documents ranked by RRF over BM25 + vector scores."""

    # BM25 — returns scores for entire corpus
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_top = np.argsort(bm25_scores)[::-1][:top_k]

    # Dense vector — cosine similarity (embeddings already normalised)
    q_emb = model.encode([query], normalize_embeddings=True)[0]
    cos_scores = embeddings @ q_emb
    vec_top = np.argsort(cos_scores)[::-1][:top_k]

    # RRF fusion
    rrf_scores: dict[int, float] = defaultdict(float)

    for rank, idx in enumerate(bm25_top, start=1):
        rrf_scores[int(idx)] += 1.0 / (rrf_k + rank)

    for rank, idx in enumerate(vec_top, start=1):
        rrf_scores[int(idx)] += 1.0 / (rrf_k + rank)

    # Sort by RRF score descending
    ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return [{"index": idx, "rrf_score": score, "text": corpus[idx]}
            for idx, score in ranked]

results = hybrid_search("ISO 27001 audit requirements", top_k=50)
# pass results[:10] to re-ranker or directly to LLM context

This is the minimal viable implementation. In a production system you would persist the BM25 index to disk (using pickle or a dedicated BM25 server), store embeddings in a vector database rather than a NumPy array, and run the two retrievals concurrently using asyncio or ThreadPoolExecutor to minimise latency.

Using LlamaIndex and LangChain hybrid retrievers

If you are already using a RAG framework, both LlamaIndex and LangChain provide first-class hybrid retrieval. For teams building net-new pipelines, these abstractions save significant boilerplate.

LlamaIndex — QueryFusionRetriever with RRF:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

# Build indexes
documents = SimpleDirectoryReader("./docs").load_data()
vector_index = VectorStoreIndex.from_documents(documents)

# BM25 retriever over the same node set
bm25_retriever = BM25Retriever.from_defaults(
    docstore=vector_index.docstore,
    similarity_top_k=10,
)
vector_retriever = vector_index.as_retriever(similarity_top_k=10)

# Hybrid retriever with RRF
hybrid_retriever = QueryFusionRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    similarity_top_k=5,
    num_queries=1,       # set >1 for query expansion
    mode="reciprocal_rerank",
    use_async=True,
)

LangChain — EnsembleRetriever:

from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain.retrievers import EnsembleRetriever
from langchain_openai import OpenAIEmbeddings

# Vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# BM25
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10

# Ensemble (weights determine relative influence, not score scaling)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6],  # tune on your eval set
)

Note that LangChain's EnsembleRetriever uses a weighted round-robin fusion rather than strict RRF — it is not identical to the formula above. For strict RRF semantics, use LlamaIndex's mode="reciprocal_rerank" or implement the formula directly.

Watch out

BM25 indices are not automatically updated when your document corpus changes. If you add or remove documents, you must rebuild the index. In production, treat BM25 index reconstruction as part of your document ingestion pipeline — not a one-time setup step. Most vector databases (Weaviate, Elasticsearch, OpenSearch) handle this transparently when you write documents; in-memory BM25 via rank_bm25 does not.

Re-ranking the merged top-N

RRF fusion gives you a strong merged list, but it does not look at semantic relevance between the query and the candidate passage at query time. It only knows rank positions. Re-ranking adds a second stage that scores each candidate in the merged list directly against the query using a more expensive model.

Two families of re-ranker are in common use:

Cross-encoders take the query and a single passage as a joint input and output a relevance score. They are highly accurate because the model sees both texts simultaneously — there is no representation bottleneck. The cost is that every candidate must be passed through the model independently, making them O(n) in the candidate count. For 50 candidates and a small cross-encoder (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2), expect 80–200ms on CPU.

Bi-encoders embed the query and each document independently and score by cosine similarity. They are faster but less accurate than cross-encoders on relevance judgements. If you are already using a dense vector retriever, your embedding model is a bi-encoder. Re-ranking with a separate bi-encoder is rarely worth the extra step unless you want to use a stronger embedding model than the one used for indexing.

The recommended production pattern is: RRF fusion over top-50 candidates from BM25 + vector, then a cross-encoder re-ranker over the top-50, passing the top-5 or top-10 re-ranked results to the LLM context.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [c for _, c in ranked[:top_n]]

# Full pipeline
fused = hybrid_search(query, top_k=50)
final = rerank(query, fused, top_n=5)
context = "\n\n".join(c["text"] for c in final)

Comparison: pure vector vs BM25 vs hybrid

Approach Exact-term recall Semantic recall Indexing complexity Latency Best for
Vector-only Poor — misses exact codes and rare terms Excellent Low — embed once, store Fast (ANN query) Semantic Q&A, paraphrase-heavy queries
BM25-only Excellent Poor — fails on paraphrase and synonyms Low — inverted index Very fast (in-memory) Keyword search, legacy document systems
Hybrid (RRF) Excellent — inherits from BM25 Excellent — inherits from vector Medium — maintain both indices Medium — parallel retrieval + merge Enterprise document Q&A, compliance search, code search
Hybrid + cross-encoder Excellent Excellent Medium Higher — cross-encoder adds 80–200ms High-stakes answers where precision matters more than speed

For Indian enterprise teams building document Q&A over regulatory filings, GST documentation, or contract repositories — and for UK teams working with FCA/PRA rulebooks, Companies House filings, or NHS clinical guidance — the hybrid + re-ranker pipeline is the practical default. The exact-term failures of vector-only retrieval cluster precisely in the document types these sectors work with.

For internal teams at product companies building semantic search over user-generated content (e.g. a support knowledge base, a community forum), vector-only is often adequate and operationally simpler.

From the editorial team

The agentic RAG research covered in the April agentic RAG papers roundup is complementary to hybrid search, not a replacement. Hierarchical tool-use retrieval and hybrid fusion solve different problems. A well-engineered agentic RAG system with only vector retrieval still fails on exact-term queries; adding BM25 fusion as one of its retrieval tools is the correct composition.

When to use each approach

The decision tree is simpler than it might appear.

Use hybrid search by default when building enterprise document Q&A, compliance assistants, contract analysis tools, code search, customer support over product documentation, or any system where users are likely to search by exact identifiers. This covers the majority of serious RAG deployments in financial services, legal, healthcare, manufacturing, and government sectors — whether the deployment is in Bengaluru, Mumbai, London, or Edinburgh.

Use vector-only retrieval when the query population is dominated by natural language paraphrase questions with no exact-term anchors, when operational simplicity is a hard constraint, or when you are doing quick prototyping and do not yet have a BM25 index ready. It is a reasonable starting point but plan for a migration when you observe exact-term failures.

Use BM25-only retrieval when you are building a traditional keyword search interface that is explicitly not attempting semantic understanding, when you are constrained to on-premise infrastructure without GPU support, or when your corpus is highly structured (e.g. a parts catalogue) and all queries are expected to contain the exact product identifiers.

Add a cross-encoder re-ranker when answer precision is more important than latency — legal research, medical guidance, regulatory compliance — and when your users will tolerate 200–400ms total retrieval time. Do not add it speculatively; measure whether it moves your evaluation metrics before committing to the latency budget.

For teams experimenting with LlamaGraph or LangGraph-based agentic pipelines — including those following the patterns in the LangGraph v0.4 guide — hybrid retrieval integrates cleanly as a tool exposed to the orchestrating agent. The agent can call the hybrid retriever when it needs broad document scanning and fall back to a cross-encoder re-ranker call when it has a narrow candidate set and needs precise ranking.

Building RAG in production? Find the Builders already doing it.

AI Tech Connect profiles verified AI Builders across India and the UK. Browse to find retrieval engineers, RAG specialists, and LLM application developers.

Browse Builders →

Production checklist

Moving from a prototype hybrid pipeline to production involves more than getting the RRF formula right. The following areas catch teams out.

Indexing pipeline

Both the BM25 index and the vector index must be rebuilt when the corpus changes. Treat document ingestion as a pipeline, not a script. Use a message queue or event-driven trigger so that when a document is added, updated, or removed, both indices update atomically. Partial updates — where the vector index has a new document but BM25 does not — produce retrieval inconsistencies that are difficult to diagnose in production logs.

Chunk size matters differently for BM25 and vector search. BM25 benefits from larger chunks because term-frequency statistics are more reliable over longer passages. Dense embedding models typically produce better results on chunks of 256–512 tokens. When using hybrid, prefer a chunk size that serves both — 300–500 tokens with a 10–15% overlap is a reasonable starting point.

Latency targets

For interactive Q&A interfaces, target end-to-end retrieval (BM25 + vector, parallel) under 150ms at p95. Add 100–200ms for a cross-encoder re-ranker if you use one. LLM generation adds 500ms–2s depending on output length and model. Total response time for a well-tuned hybrid pipeline with re-ranking should be under 3 seconds at p95 for most deployments.

Run BM25 and vector retrieval in parallel. In Python, asyncio.gather or concurrent.futures.ThreadPoolExecutor both work. The two operations are independent and parallelism is free latency reduction.

Monitoring retrieval quality

Log the following for every query in production: query text, BM25 top-10 document IDs and scores, vector top-10 document IDs and scores, RRF-merged top-10, re-ranker top-5 (if applicable), the final context passed to the LLM, and the LLM answer. This telemetry is the foundation of any improvement effort.

Define at least one automated retrieval quality signal. The simplest is "did any document from the ground-truth set for this query type appear in the top-5?" If you do not have a labelled evaluation set, build one: sample 50–100 representative queries from your production logs, have a domain expert label the correct source documents for each, and run your pipeline against this set weekly. This is the practice that separates teams who improve their RAG systems systematically from those who iterate on hunches.

Track BM25-vector overlap rate: the fraction of queries for which the BM25 and vector top-5 share no common documents. High overlap (above 70%) means you may not need hybrid — your queries are semantically consistent and vector-only might suffice. Low overlap (below 30%) means the two signals are largely complementary and hybrid fusion is delivering substantial value.

For teams building on top of models like Llama 4 Maverick or Scout, the retrieval quality improvements from hybrid search compound with the improved context-window utilisation of these models. A hybrid pipeline that surfaces the right documents into a 10M-token context window is a materially different system from a vector-only pipeline producing the same context length.

Infrastructure considerations

For corpora under 500,000 documents, an in-memory BM25 index via rank_bm25 with embeddings stored in a lightweight vector store (FAISS, Chroma) is adequate and operationally simple. For larger corpora, or where you need document-level access controls, prefer a managed search platform with native hybrid support: Elasticsearch or OpenSearch both offer BM25 + vector hybrid search with RRF as a first-class query type. Weaviate's hybrid search operator likewise runs both retrievers and fuses with RRF internally.

If you are deploying on Indian cloud infrastructure (AWS Mumbai, GCP Mumbai, Azure Pune) or UK-based infrastructure (AWS London, Azure UK South), the latency profile is similar — BM25 and dense retrieval both operate in-region and the dominant latency factor is cross-encoder inference if you add it. Consider deploying the cross-encoder on the same instance as your retrieval service to avoid serialisation overhead on the candidate list.

Related reading for teams building the full retrieval stack: the research and infra sections of AI Tech Connect cover embedding model releases, vector database updates, and retrieval benchmarks as they land.