What you need to know
Once your product calls an LLM, you lose the comfort of deterministic tests. The same prompt can return a different answer each run, so an exact-match assertion is useless and a human reviewing every output does not scale. The pragmatic answer the industry has converged on is the LLM-as-a-judge: a second model, given a rubric, that scores or ranks the outputs of the first. As of mid-2026 this is the default evaluation pattern behind most production agent, RAG and chat systems — but it is also where a lot of teams quietly fool themselves.
- Pick the right shape. Pairwise (A-vs-B) is more reliable for comparing systems; pointwise (direct scoring) gives you the absolute number you need to gate CI.
- Write the rubric, not vibes. A good judge prompt has an explicit scale, concrete criteria, and a require-reasoning-before-score instruction so the verdict is auditable.
- Mitigate the four biases. Position, verbosity, self-preference and formatting bias each have a cheap, concrete fix. Skip them and your numbers are noise.
- Calibrate against humans. A judge you have not measured against a golden set is a vibe with extra steps. Cohen's kappa tells you whether to trust it.
- Gate in CI and watch the cost. An eval that does not block a regression on a pull request will not save you in production. A judge can add 30 to 50 per cent to your inference bill, so sample and cache.
Everything in this guide is deliberately model-agnostic. Whether you judge with a frontier closed model or a self-hosted open-weight model, the rubric, the bias controls and the kappa calibration are the same. That is what keeps an eval suite useful for years rather than weeks — when the leading model changes, only the model name in your config changes.
Prerequisites
Before you write a single judge prompt, you want three things in place. First, a task definition that a human could grade — if your team cannot agree on what a good answer looks like, no judge can. Second, a handful of real production inputs, ideally a few dozen, captured with their outputs; synthetic-only eval sets drift away from reality fast. Third, an access path to two or more model families so you are never forced to let a model grade its own homework. Beyond that you need somewhere to store results (a table or a JSONL file under version control is plenty) and a CI runner. None of this requires a vendor platform; the whole pattern works with a Python script and a few API keys.
Pairwise versus pointwise: when to use each
There are two fundamental judging modes, and choosing wrongly is the most common early mistake.
Pointwise (direct scoring) hands the judge one output and a rubric, and asks for an absolute score — say 1 to 5 on faithfulness. It is cheap (one call per output), it produces a number you can threshold in CI, and it lets you track a metric over time. Its weakness is calibration drift: absolute judgements are hard, and a judge's notion of what a "4" means can wander between model versions and even between runs.
Pairwise (comparative) hands the judge two outputs — typically a candidate and a baseline — and asks which is better. Humans and models are both far more reliable at relative judgements than absolute ones, so pairwise is the gold standard for model selection, prompt iteration and A/B decisions. The cost is that you get a win-rate, not an absolute score, and you need a fixed baseline to compare against.
| Dimension | Pointwise (direct scoring) | Pairwise (comparative) |
|---|---|---|
| Best for | CI gating, trend monitoring, dashboards | Model/prompt selection, A/B decisions |
| Output | Absolute score (e.g. 1–5) | Win / lose / tie vs a baseline |
| Reliability | Lower — absolute calibration drifts | Higher — relative judgement is easier |
| Cost per item | 1 judge call | 2 calls (swapped order) for position control |
| Main failure mode | Score inflation/drift over time | No absolute number; needs a fixed baseline |
Run both. Use pairwise against a frozen baseline when you are deciding whether to ship a new prompt or model, and use pointwise on a fixed rubric for the nightly regression number that gates your pull requests. They answer different questions, and most mature eval suites carry both.
Copy-paste judge-prompt templates
A judge is only as good as its prompt. The three non-negotiables: an explicit rubric with concrete criteria, a defined score scale with anchor descriptions, and a require-reasoning-before-score instruction so the model commits to an argument before it commits to a number. Forcing the reasoning first measurably improves agreement and gives you an audit trail when a score looks wrong.
Pointwise judge (direct scoring on a rubric)
This template scores a single answer for faithfulness to a provided context — the workhorse metric for RAG. Keep the scale short; 1 to 5 is easier to calibrate than 1 to 10.
You are a strict, impartial evaluator. Score the ANSWER for FAITHFULNESS
to the CONTEXT only. Do not reward style, length, or fluency.
RUBRIC — Faithfulness (1-5):
5 = Every claim in the answer is directly supported by the context.
4 = All claims supported; one minor unsupported but harmless aside.
3 = Mostly supported; one claim not grounded in the context.
2 = Several claims unsupported or contradicted by the context.
1 = Answer is largely fabricated relative to the context.
RULES:
- Judge ONLY against the CONTEXT, never your own knowledge.
- Length and confident tone do NOT earn points.
- Reason step by step BEFORE giving a score.
CONTEXT:
{{context}}
ANSWER:
{{answer}}
Respond as JSON only:
{"reasoning": "<2-4 sentences citing specific claims>", "score": <1-5>}
Pairwise judge (A-vs-B with a tie option)
This template compares two answers to the same question. Note the explicit permission to call a tie — without it, judges manufacture a winner and add noise. You will call this twice per pair with A and B swapped (see bias mitigation below).
You are an impartial evaluator comparing two assistant answers to the
same QUESTION. Decide which answer is more helpful, correct, and
grounded. Ignore length and superficial polish.
CRITERIA (in priority order):
1. Factual correctness and grounding in the question's intent.
2. Completeness — does it actually resolve the request?
3. Clarity — only as a tie-breaker.
RULES:
- A longer answer is NOT automatically better.
- If the two answers are of genuinely similar quality, return "tie".
- Reason BEFORE deciding.
QUESTION:
{{question}}
ANSWER A:
{{answer_a}}
ANSWER B:
{{answer_b}}
Respond as JSON only:
{"reasoning": "<2-4 sentences>", "winner": "A" | "B" | "tie"}
Always request JSON and parse it. Free-text verdicts are unparseable at scale and the score buried in a paragraph is exactly where formatting bugs hide. Set a low temperature (0 to 0.3) on the judge for reproducibility, and validate the parsed score is in range before you trust it — a judge that returns a "6" on a 1 to 5 scale is telling you the prompt leaked.
The bias-mitigation checklist
An uncalibrated judge carries four well-documented biases. Each one silently inflates or distorts your numbers, and each has a concrete, cheap fix. As of mid-2026 these are the four that matter most in practice.
Position bias
In pairwise judging, models prefer whichever answer comes first, sometimes choosing the first-positioned answer up to roughly three-quarters of the time regardless of quality. Fix: run every comparison twice with the order swapped, and only count a win when the judge picks the same answer in both orders. If the verdict flips when you swap, score it a tie. This doubles pairwise cost but it is the single highest-leverage control you can add.
Verbosity bias
Judges tend to reward longer answers even when the extra words add nothing. Fix: state explicitly in the prompt that length earns no points (as in the templates above), and as a backstop normalise for length — if your winner is consistently the longer answer, audit a sample by hand and consider a length-controlled win-rate that discounts pure verbosity.
Self-preference bias
A model rates text written in its own style more highly — measured at roughly 10 to 25 per cent inflation when a model judges its own family. Fix: never let a model judge its own outputs without controls. Use a different model family as the judge, or run an ensemble of judges from different vendors and take the majority verdict. When you must use the same family, calibrate hard against human labels and treat the score as a lower-confidence signal.
Formatting bias
Markdown tables, bullet lists and bold headers can earn a higher score purely for looking organised. Fix: tell the judge to ignore formatting and grade substance, and where possible strip or normalise markdown before judging so two answers compete on content, not presentation.
Do not let your newest, most expensive model be both the thing under test and the judge of its own family in the same run. It is the fastest way to ship a regression while your dashboard shows green — self-preference bias will quietly hand it the win.
Building a golden dataset and measuring agreement
A judge you have not measured is a guess. The discipline that turns it into an instrument is the golden dataset: a fixed set of inputs with human-assigned labels that you treat as ground truth. Aim for variety over volume to start — 50 to 200 examples that span your easy, hard and adversarial cases beat thousands of near-duplicates. Have at least two humans label independently, reconcile disagreements, and freeze the result under version control. This is the same golden-set discipline that underpins a broader evaluation suite built on golden sets and judges.
Now run your judge over the same examples and compare its verdicts to the human labels. The right statistic is Cohen's kappa, not raw accuracy, because kappa corrects for the agreement you would expect by chance. The widely cited Landis and Koch interpretation bands are the standard reference:
| Cohen's kappa (κ) | Interpretation | What it means for your judge |
|---|---|---|
| < 0.00 | No agreement | Worse than chance — the rubric or judge is broken |
| 0.00 – 0.20 | Slight | Do not trust any number from this judge |
| 0.21 – 0.40 | Fair | Rewrite the rubric; usually ambiguous criteria |
| 0.41 – 0.60 | Moderate | Usable for coarse trends, not for tight gating |
| 0.61 – 0.80 | Substantial | The minimum bar for production gating |
| 0.81 – 1.00 | Almost perfect | Trust the judge as a stand-in for a human |
A small worked example makes it concrete. Suppose two judge prompts are scored against the same 100-item golden set, each item labelled pass or fail by humans:
| Judge configuration | Raw agreement | Cohen's κ | Band | Verdict |
|---|---|---|---|---|
| Judge v1 (vague "is this good?" prompt) | 78% | 0.38 | Fair | Reject — rewrite rubric |
| Judge v2 (explicit rubric + reasoning-first) | 89% | 0.71 | Substantial | Ship — gate-eligible |
| Judge v2 + ensemble (3 model families) | 92% | 0.82 | Almost perfect | Trust for high-stakes calls |
Notice that raw agreement barely moved (78 to 89 per cent) while kappa jumped from fair to substantial. That gap is exactly why you report kappa: when the classes are imbalanced — most answers pass — high raw accuracy can hide a judge that is barely better than always guessing "pass". Compute kappa with any stats library: in Python, sklearn.metrics.cohen_kappa_score(human_labels, judge_labels).
Gating CI/CD on eval scores
An eval that lives in a notebook protects nobody. The payoff comes when it runs on every pull request and fails the build if quality drops below a threshold you have calibrated. Here is a minimal GitHub Actions workflow that runs your judge over the golden set and blocks the merge if the mean score falls under the bar.
name: llm-eval-gate
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r eval/requirements.txt
- name: Run LLM-as-a-judge evals
env:
JUDGE_API_KEY: ${{ secrets.JUDGE_API_KEY }}
# Pin the judge region for data residency (see dual-market note)
JUDGE_REGION: ${{ vars.JUDGE_REGION }}
run: python eval/run_judge.py --golden eval/golden.jsonl --out eval/result.json
- name: Gate on score threshold
run: |
python - <<'PY'
import json, sys
r = json.load(open("eval/result.json"))
THRESHOLD = 0.80 # mean faithfulness, 0-1 normalised
MIN_KAPPA = 0.60 # judge must agree with humans
print(f"score={r['mean_score']:.3f} kappa={r['kappa']:.3f}")
if r["kappa"] < MIN_KAPPA:
sys.exit("FAIL: judge no longer trustworthy (kappa below 0.60)")
if r["mean_score"] < THRESHOLD:
sys.exit(f"FAIL: quality regressed below {THRESHOLD}")
print("PASS")
PY
Gate on the judge's own kappa as well as the score. If a model upgrade silently degrades the judge's agreement with your golden labels, a pure score threshold will happily pass a build whose evaluator has stopped working. Re-checking kappa in the same job catches a broken judge before it waves through a broken product.
Controlling the cost
Judging is not free. Because every output gets scored at least once — and pairwise or multi-criteria judging multiplies that — an LLM judge typically adds on the order of 30 to 50 per cent to your inference spend. That is affordable if you are deliberate and ruinous if you score everything with your most expensive model. Four levers, in order of impact:
- Sample, do not score everything. You rarely need to judge 100 per cent of production traffic. A representative sample — say 1 to 5 per cent, stratified across your important input types — gives you a stable quality signal at a fraction of the cost.
- Use a cheaper judge model. A mid-tier or small model often agrees with humans well enough once you have validated it on the golden set. Spend the frontier model only on the nightly or pre-release golden run.
- Cache judge verdicts. Key the cache on a hash of the judge prompt plus the input and output. Identical inputs in CI re-runs then cost nothing, which matters because the same golden set runs on every pull request.
- Reserve depth for where it counts. Run cheap pointwise scoring continuously and the expensive swapped-order pairwise ensemble only at release gates.
These levers compose with the same caching and routing discipline you would apply to the generation side — see our guide on cutting LLM costs with prompt caching and model routing for the underlying techniques, which apply equally to the judge.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →Dual-market practicalities: India and the UK
Where you run the judge matters as much as how. The moment your golden set contains real production inputs, it almost certainly contains personal data — names, support tickets, account details — and that brings data-residency obligations into scope on both sides of the AITC audience.
In India, the Digital Personal Data Protection regime pushes teams towards keeping personal data processed within the country where feasible; running your judge against an API endpoint in the AWS Mumbai (ap-south-1) region, or self-hosting an open-weight judge on Indian infrastructure, keeps a golden set of customer data in-jurisdiction. In the UK and EU, UK GDPR and the EU framework make a London-region endpoint (AWS eu-west-2) or an EU region the safer default for the same reason. The practical move is to make the judge region a configuration value — note the JUDGE_REGION variable in the CI workflow above — so the same eval code can target Mumbai for Indian data and London for UK data without a fork.
If your golden set contains personal data, pin the judge to a regional endpoint that matches where the data must live (AWS Mumbai for India, AWS London for the UK), or self-host an open-weight judge so nothing leaves your boundary. Treat the golden set itself as a regulated asset: access-controlled, region-locked, and never copied into a notebook on someone's laptop.
Pitfalls: how LLM-judge evals silently break
The dangerous failures are the quiet ones — the eval keeps returning numbers, so nobody looks, while the numbers have stopped meaning anything. As of mid-2026 these are the recurring ways teams get burned:
- The judge model changed under you. A provider deprecates or auto-upgrades the model behind an alias and your judge's calibration shifts overnight. Pin exact model versions for judges and re-run kappa whenever the version moves.
- The golden set went stale. Your product evolved but the golden set did not, so you are gating on questions nobody asks any more. Refresh it on a schedule and add every production incident as a new case.
- Score drift masquerading as improvement. Pointwise scores creep up over months not because quality rose but because the judge's sense of "4" softened. Periodic pairwise checks against a frozen baseline catch this.
- Test-set contamination. Your golden inputs leaked into a prompt, a few-shot example or fine-tuning data, so the system "passes" by memorisation. Keep the golden set out of any context the system under test can see.
- Gaming the rubric. Teams optimise the prompt until the judge is happy rather than until users are. Anchor the judge to human labels with kappa, and re-label a fresh human sample periodically to make sure the judge still tracks reality.
- Single-judge monoculture. One judge model becomes the arbiter of truth and its blind spots become your blind spots. An occasional ensemble or human spot-check keeps you honest.
A worked example: gating a RAG answer pipeline
To tie it together, picture a retrieval-augmented support assistant. You assemble a 120-item golden set of real questions, each with the retrieved context and a human pass/fail on faithfulness. You write the pointwise faithfulness judge above, run it against the golden set, and measure Cohen's kappa at 0.71 — substantial, gate-eligible. You wire it into the GitHub Actions workflow with a 0.80 mean-score threshold and a 0.60 kappa floor. A fortnight later, a pull request that "improves" the prompt sails through code review but drops the mean faithfulness score to 0.74 because it started pulling in tangential context. The gate fails, the regression never ships, and the eval has earned its keep. This is the same evaluation discipline that sits underneath a robust retrieval stack — see our guides on hybrid retrieval and agent observability for production RAG and on RAG chunking and embedding strategies for the upstream pieces a judge ultimately scores.
The bottom line
LLM-as-a-judge is the only realistic way to evaluate non-deterministic systems at scale, and it works — but only when you treat the judge as an instrument that must itself be calibrated, not as an oracle. Choose pairwise or pointwise deliberately, write a real rubric with reasoning-before-score, neutralise the four biases, prove the judge against a golden set with Cohen's kappa, gate it in CI, and keep the cost in check by sampling and caching. Do that and your eval suite stays useful for years, surviving every model swap, because the discipline lives in the rubric and the calibration rather than in any one model.