What you need to know
Most RAG systems that fail in production do not fail at the language model. They fail at retrieval — the right chunk was never fetched, or it was fetched and then buried at rank 40 where the generator never saw it — and the team has no trace to prove what went wrong. This guide is the architecture that fixes both problems. Three things to take away before we go deep:
- Hybrid retrieval is the 2026 baseline. Run lexical BM25 and dense semantic search in parallel and fuse the two ranked lists with reciprocal rank fusion. Published benchmarks put the lift at roughly +8 percentage points on Recall@5 over BM25 alone on text-and-table corpora.
- A reranker is your highest-leverage precision stage. A cross-encoder over the top 50 to 100 fused candidates lifts Recall@5 from around 0.70 to above 0.81 in financial-QA benchmarking — the relevant chunk moves into the few results the generator actually reads.
- You cannot operate what you cannot see. Wrap every retrieval, rerank, generation and tool call in an OpenTelemetry span. Trace before you ship, not after the incident.
This article deliberately does not re-teach chunking and embedding strategy. That is a deep topic in its own right, and we have a dedicated guide for it — see RAG chunking strategies: fixed, semantic and hierarchical. Here we assume your chunks already exist and focus on the two layers that separate a demo from a production system: hybrid retrieval with reranking, and agent observability.
Treat retrieval, reranking and generation as three independent, swappable modules behind clean interfaces. When recall drops next quarter you want to A/B a new reranker without touching the generator — and you want the trace to tell you which module regressed.
Prerequisites and the shape of the system
This guide assumes you already have a working corpus: documents split into chunks, each chunk embedded with a sentence-embedding model and stored in a vector index (pgvector on a Mumbai or London Postgres instance, Qdrant, or any equivalent), plus the raw text available for a lexical index. If you are not there yet, build that first and come back.
The reference architecture is a linear pipeline of modular stages:
query
-> [ retriever ] BM25 list + dense list (run in parallel)
-> [ fusion ] reciprocal rank fusion -> single ranked list
-> [ reranker ] cross-encoder scores top 50-100 candidates
-> [ generator ] LLM answers over top 3-8 reranked chunks
-> answer + citations
every stage emits an OpenTelemetry span: latency, inputs, outputs, cost
Each box is a function with a stable signature, so any stage can be replaced or removed independently. The observability layer is not a box in the diagram because it wraps every box. That is the whole point: the reference architecture is the pipeline plus the trace.
Why hybrid retrieval, and how fusion actually works
Dense embeddings are excellent at semantic similarity — they understand that "how do I cancel my plan" and "stop my subscription" mean the same thing. But they smooth over exactly the tokens that often matter most in production: product codes, error identifiers, statute numbers, the name of a specific API, a part number. Lexical BM25, the decades-old sparse method, matches those exact terms reliably but is blind to paraphrase. The two methods fail in opposite directions, which is precisely why combining them works.
As of mid-2026, hybrid retrieval is widely treated as the minimum viable retriever for a serious RAG deployment rather than an optimisation you bolt on later. The reason you fuse ranked lists rather than raw scores is that a BM25 score and a cosine similarity live on completely incompatible scales — adding or averaging them is statistically meaningless without careful normalisation. Reciprocal rank fusion (RRF) sidesteps this entirely by ignoring the scores and using only rank positions.
RRF gives each document a score of 1 / (k + rank) in each list, then sums those contributions across lists. The smoothing constant k defaults to 60, the value from the original 2009 RRF paper and still the community default. A document that ranks high in either the lexical or the dense list — or, better still, both — rises to the top of the fused list.
from rank_bm25 import BM25Okapi
import numpy as np
# Assume: `chunks` is a list of dicts {id, text, embedding}
# and `embed(text)` returns a normalised numpy vector.
# --- Build the lexical index once, at ingest time ---
tokenised_corpus = [c["text"].lower().split() for c in chunks]
bm25 = BM25Okapi(tokenised_corpus)
doc_matrix = np.stack([c["embedding"] for c in chunks]) # (N, dim), L2-normalised
def bm25_search(query: str, top_k: int = 50):
scores = bm25.get_scores(query.lower().split())
ranked = np.argsort(scores)[::-1][:top_k]
return [chunks[i]["id"] for i in ranked]
def dense_search(query: str, top_k: int = 50):
q = embed(query) # (dim,)
sims = doc_matrix @ q # cosine, since both are normalised
ranked = np.argsort(sims)[::-1][:top_k]
return [chunks[i]["id"] for i in ranked]
def reciprocal_rank_fusion(ranked_lists, k: int = 60, top_k: int = 100):
"""Fuse several ranked ID lists. k=60 is the canonical default."""
scores: dict[str, float] = {}
for ranked in ranked_lists:
for rank, doc_id in enumerate(ranked, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
fused = sorted(scores, key=scores.get, reverse=True)
return fused[:top_k]
def hybrid_retrieve(query: str, top_k: int = 100):
lexical = bm25_search(query, top_k=50)
dense = dense_search(query, top_k=50)
return reciprocal_rank_fusion([lexical, dense], k=60, top_k=top_k)
Two practical notes. First, retrieve a generous candidate pool from each method (50 each is a sensible starting point) because the reranker downstream needs raw material to work with. Second, in production you would push the BM25 stage into your search engine — Elasticsearch, OpenSearch or Postgres full-text — rather than holding the index in process; the logic is identical, only the call site changes.
Some teams replace RRF with a weighted score combination (a convex blend such as 0.5 BM25 plus 0.5 dense after min-max normalisation). On some corpora that edges out RRF, but it is fragile: it needs per-corpus normalisation and weight tuning, and it breaks silently when a new document distribution arrives. RRF is the robust default precisely because it has one parameter and no score-scaling assumptions. Tune away from it only with an evaluation harness watching.
The reranking stage: cross-encoders earn their latency
Fusion gives you a good list of 100 candidates. The problem is that your generator should only read a handful — typically three to eight chunks — or it drowns in noise and your token bill balloons. The reranker's job is to take that list of 100 and decide which few are genuinely most relevant.
The retrieval stage used a bi-encoder: query and documents were embedded separately, which is fast and scales to millions of chunks but loses the fine-grained interaction between the query's words and the document's words. A cross-encoder reranker feeds the query and one candidate document into the model together, so every query token can attend to every document token. It is far more accurate and far slower — which is exactly why you only run it on the top 100, not the whole corpus.
The numbers justify the latency. In financial-QA benchmarking, a two-stage pipeline of hybrid retrieval followed by neural reranking reached Recall@5 of 0.816, against 0.695 for hybrid RRF alone (a 17.4% relative lift), 0.644 for BM25 and 0.587 for dense retrieval. Separate benchmarking on text-and-table data shows Recall@5 climbing as the candidate pool grows — about 0.826 at 50 candidates and 0.888 at 100 — before degrading if you rerank far beyond that. The practical sweet spot reported across studies is reranking 50 to 200 candidates; push past it and recall starts to fall again while latency keeps rising.
from sentence_transformers import CrossEncoder
# Load once at startup. BGE / mxbai / Cohere-style cross-encoders all fit here.
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)
def rerank(query: str, candidate_ids: list[str], top_n: int = 6):
by_id = {c["id"]: c for c in chunks}
pairs = [(query, by_id[cid]["text"]) for cid in candidate_ids]
scores = reranker.predict(pairs) # one score per (query, doc) pair
order = sorted(range(len(candidate_ids)),
key=lambda i: scores[i], reverse=True)
return [candidate_ids[i] for i in order[:top_n]]
def retrieve_for_generation(query: str, top_n: int = 6):
fused = hybrid_retrieve(query, top_k=100) # fusion over 100
return rerank(query, fused, top_n=top_n) # cross-encoder picks the best 6
| Configuration | Recall@5 (financial-QA benchmark) | Relative latency | Relative cost |
|---|---|---|---|
| Dense-only (bi-encoder) | 0.587 | Low (1×) | Low — one ANN lookup |
| BM25-only (lexical) | 0.644 | Low (1×) | Low — one index query |
| Hybrid + RRF | 0.695 | Low–medium (~2×) | Low — two lookups, cheap fusion |
| Hybrid + RRF + cross-encoder rerank | 0.816 | Medium–high (rerank dominates) | Medium — GPU/API call per query |
Figures are drawn from published 2026 retrieval benchmarking and will shift with your corpus, embedding model and reranker; treat them as the shape of the trade-off, not a guarantee for your data. The shape is consistent across the literature, though: each stage you add buys recall at the cost of latency, and reranking is where most of the recall lives.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →Agent observability: the part that gets skipped
Here is the differentiator. Plenty of teams build the retrieval pipeline above and ship it blind. Then a user reports a bad answer, and nobody can say whether retrieval missed the chunk, the reranker buried it, the generator ignored it, or a tool call timed out. Without a trace you are debugging by re-running prompts and guessing. Agent observability is the discipline that makes the system legible.
The pattern is structured tracing. A single user request becomes one trace, and each step within it — retrieval, fusion, rerank, each tool call, the final generation — becomes a child span. Every span records its latency, its inputs and outputs, and for model calls the token counts and cost. When something goes wrong, you open the trace and read the timeline.
As of mid-2026, the portable way to do this is the OpenTelemetry GenAI semantic conventions. The OpenTelemetry project formed its GenAI special-interest group in April 2024, and the conventions now define a standard vocabulary for prompts, responses, token usage and tool and agent calls. Using the standard attribute names means the same instrumentation feeds Langfuse, Arize, Datadog, Honeycomb, New Relic or your own collector without rewrites — and frameworks such as LangChain, CrewAI and AutoGen already emit compliant spans. The attributes you will reach for most often:
gen_ai.operation.name— the operation type, for examplechatorembeddings.gen_ai.request.modelandgen_ai.response.model— the model requested and the model that actually answered.gen_ai.usage.input_tokensandgen_ai.usage.output_tokens— the per-call token counts that drive your cost attribution.gen_ai.system— the provider (for exampleanthropicoropenai).
Here is the retrieval-plus-generation path instrumented with OpenTelemetry. Note how each stage is its own span and how token usage is attached using the convention attributes, so a downstream tool can compute cost per request without bespoke parsing.
from opentelemetry import trace
tracer = trace.get_tracer("rag.pipeline")
def answer(query: str) -> dict:
# One span per user request — the root of the trace.
with tracer.start_as_current_span("rag.request") as root:
root.set_attribute("rag.query", query)
# --- Retrieval span ---
with tracer.start_as_current_span("rag.retrieve") as s:
fused = hybrid_retrieve(query, top_k=100)
s.set_attribute("rag.candidates", len(fused))
# --- Rerank span ---
with tracer.start_as_current_span("rag.rerank") as s:
top_ids = rerank(query, fused, top_n=6)
s.set_attribute("rag.reranked_to", len(top_ids))
# --- Generation span, using GenAI semantic-convention attributes ---
with tracer.start_as_current_span("rag.generate") as s:
s.set_attribute("gen_ai.operation.name", "chat")
s.set_attribute("gen_ai.request.model", "claude-sonnet")
context = build_context(top_ids)
resp = call_llm(query, context) # your provider SDK call
s.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
s.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)
s.set_attribute("gen_ai.response.model", resp.model)
return {"answer": resp.text, "sources": top_ids}
The same discipline extends to tool calls in an agentic loop. Wrap each tool invocation in its own span, and record whether it succeeded, how long it took and what it returned. The failure modes you most want to catch in production are the ones a single number on a dashboard hides: a tool that silently returns an empty result, a retrieval step whose latency has crept from 80 ms to 600 ms, or a generation that quietly switched to a fallback model.
def call_tool(name: str, args: dict):
with tracer.start_as_current_span(f"tool.{name}") as s:
s.set_attribute("tool.name", name)
s.set_attribute("tool.args", str(args))
try:
result = TOOLS[name](**args)
s.set_attribute("tool.status", "ok")
s.set_attribute("tool.result_len", len(str(result)))
return result
except Exception as exc: # noqa: BLE001 — record then re-raise
s.set_attribute("tool.status", "error")
s.record_exception(exc)
raise
For the deep treatment of OpenTelemetry instrumentation — tail-sampled traces, cost attribution and span-attribute design — see our companion guide, instrumenting agents for production with OpenTelemetry. The principle to carry into every project is simple: trace before you ship. Build the spans into the pipeline from the first commit, so that on the day something breaks in production the answer is already in the trace.
Tag every span with a stable trace_id and surface it to the user as a request reference. When a Mumbai support agent or a London on-call engineer gets "answer was wrong for request ABC123", they paste the ID into your tracing back-end and read the exact retrieval, rerank and generation steps — no re-running, no guessing.
Evaluating retrieval and answer quality
Observability tells you what happened; evaluation tells you whether it was any good. The two operate on different layers of the pipeline, and you need both. Split your metrics cleanly:
- Retrieval metrics answer "did we fetch and rank the right chunk?" Recall@k measures whether the relevant chunk appears anywhere in the top
k— your floor metric, because if it is not retrieved nothing downstream can save the answer. nDCG (normalised discounted cumulative gain) goes further and rewards ranking the relevant chunk near the top, which matters because the generator weights early context more heavily. nDCG is the preferred reranker metric precisely because it captures graded relevance rather than a binary hit-or-miss. - Answer metrics answer "was the generated response grounded and useful?" Faithfulness (also called groundedness) checks whether each claim in the answer is supported by the retrieved context; a faithfulness evaluator extracts the discrete claims and runs a natural-language-inference check of each against the sources. Answer relevancy checks the response actually addresses the query, and context relevance checks the retrieved chunks were on-topic. Together these three form what some teams call the RAG triad.
In practice you run these with an LLM-as-judge framework — Ragas is the common open-source choice — against a fixed golden set of question-and-ideal-context pairs that you version like code. Run the suite in CI on every retriever or reranker change, and watch the trend, not a single run. For building that golden set and wiring up the judges, see build your first LLM evaluation suite.
Do not optimise faithfulness in isolation. A system that only ever answers "I don't know" is perfectly faithful and completely useless. Track faithfulness alongside answer relevancy so you catch the model trading helpfulness for safety — and always evaluate against the same golden set across versions, or your numbers are not comparable.
Common pitfalls when you put it together
- Reranking too many candidates. More is not better past a point. Recall@k degrades when you push a cross-encoder over hundreds of documents, and latency rises linearly. Stay in the 50-to-200 candidate band reported across benchmarks.
- Treating RRF
kas a tuning knob without evidence. The default of 60 is robust; changing it should be an evaluation-driven decision, not a vibe. - Coupling the stages. If your reranker call is hard-wired inside the retriever, you cannot A/B a new model. Keep retriever, reranker and generator as independent modules — see the six agent design patterns every builder should know for the broader modular-agent picture.
- Shipping without traces. The first production incident is the worst time to discover you have no observability. Instrument from commit one.
- Re-embedding the world for a chunking change. Decouple chunking and embedding decisions from retrieval logic; the chunking guide covers how to evolve those without a full re-index, and long-running agents need their own memory strategy — see agent memory management patterns.
Put the four layers together — hybrid retrieval, fusion, reranking and a fully traced pipeline with an evaluation harness watching it — and you have a RAG system you can operate, debug and improve with evidence rather than guesswork. That is the difference between a demo and production, and as of mid-2026 it is fast becoming the baseline that hiring managers in Bengaluru and Manchester expect to see in a serious Builder's portfolio.