Why evaluation is the craft skill most builders skip

There is a common arc in LLM application development. A builder spends two weeks getting the first working demo. The prompt is tuned by eye, a few awkward examples get smoothed over, and the system goes live feeling good. Then — nothing bad happens immediately. The application works. Users seem roughly satisfied. And so the next two weeks go into shipping new features, not measuring the existing ones.

This pattern holds until one of three things breaks the spell. A model upgrade changes subtle behaviour. A prompt change intended to fix one edge case introduces a regression elsewhere. Or the application scales to enough users that the tail of bad outputs, invisible in testing, becomes a daily customer complaint. At that point the builder discovers they have no baseline to regress against, no automated harness to catch the change before it shipped, and no way to know whether fixing the new failure will reintroduce the old one.

Evaluation is how professional software teams catch regressions before users do. In traditional software, this is called a test suite. In LLM applications, the same principle applies but the mechanics are different: outputs are probabilistic, correctness is often subjective, and a rule-based assertion that the string "Paris" appears in the output is not always the right measure of quality. The evaluation discipline that has emerged across teams shipping LLM applications in 2026 combines two complementary approaches — deterministic checks where the answer can be verified mechanically, and LLM-as-judge where it cannot — anchored to a curated golden test set and wired into CI so regressions are caught at the pull request stage rather than in production.

This is not a research topic. It is an operational practice that determines whether your LLM application stays reliable as it evolves. Builders who invest in it early spend far less time firefighting later. The rest of this guide is a step-by-step walkthrough of the full workflow — starting with the test set itself, because that is what everything else is measured against.

Anatomy of a golden test case

A golden test set is a curated collection of input/output pairs that represent the behaviour your system should exhibit. Each case is a contract: given this input, produce output that satisfies these criteria. The word "golden" is important — these are not randomly sampled examples, they are carefully chosen to cover the realistic distribution of inputs your system will see, including the edge cases and adversarial inputs that reveal where the system is brittle.

The minimum viable test case has four fields: an input, an expected output or reference, a tolerance specification, and metadata. In practice you will also want tags so you can slice results by category, and a source field so you can trace where the case came from.

Field Type Purpose Example value
id str Stable identifier across runs "tc-summarise-001"
input dict The full input payload to the system {"text": "...", "max_words": 50}
expected_output str | None Reference answer for deterministic checks; None if judge-only "The quarterly results showed..."
eval_type str Which evaluator to use: exact / regex / schema / similarity / judge "judge"
tolerance float Similarity threshold (0–1) for semantic checks; ignored for exact/judge 0.85
tags list[str] For slicing results by category ["edge-case", "long-input"]
source str Provenance — production log, human-curated, synthetic "production-log-2026-05"
min_judge_score int Minimum acceptable 1–5 judge score for this case to pass 4

In Python, a clean representation uses a dataclass so the fields are typed and serialisable to JSON:

from dataclasses import dataclass, field, asdict
from typing import Optional
import json

@dataclass
class GoldenTestCase:
    id: str
    input: dict
    eval_type: str  # "exact" | "regex" | "schema" | "similarity" | "judge"
    expected_output: Optional[str] = None
    tolerance: float = 0.85          # semantic similarity threshold
    min_judge_score: int = 4         # 1-5; case passes if judge score >= this
    tags: list[str] = field(default_factory=list)
    source: str = "human-curated"
    metadata: dict = field(default_factory=dict)

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2)


# Example: an exact-match case for a structured extraction task
case_exact = GoldenTestCase(
    id="tc-extract-entity-001",
    input={"text": "Invoice #INV-2026-0441 issued to Acme Ltd on 9 June 2026."},
    eval_type="exact",
    expected_output='{"invoice_number": "INV-2026-0441", "client": "Acme Ltd"}',
    tags=["extraction", "structured-output"],
    source="human-curated",
)

# Example: a judge-evaluated case for open-ended summarisation
case_judge = GoldenTestCase(
    id="tc-summarise-earnings-001",
    input={
        "text": "Full earnings transcript...",  # long input
        "max_words": 80,
    },
    eval_type="judge",
    expected_output=None,   # no reference; judge evaluates quality directly
    min_judge_score=4,
    tags=["summarisation", "finance"],
    source="production-log-2026-05",
)

# Serialise your golden set to JSONL for version control
def save_golden_set(cases: list[GoldenTestCase], path: str) -> None:
    with open(path, "w") as f:
        for c in cases:
            f.write(c.to_json() + "\n")

Store your golden set in version control alongside your code. A JSONL file (one JSON object per line) is easy to diff and review in pull requests. Aim for 200 to 500 cases at launch, with deliberate coverage of edge cases, high-stakes inputs, and the specific failure modes you have already observed. Grow the set incrementally — every production incident that reveals a new failure mode should generate at least one new golden case before you close the ticket.

Pro tip

Source at least 20% of your initial golden set from real production logs, not synthetic examples. Synthetic cases look clean but miss the messy, ambiguous inputs that real users actually send. If you do not have production logs yet, seed the set with examples you wrote yourself and plan to replace them progressively as real traffic arrives.

Deterministic evals vs LLM-as-judge: when to use each

The choice between deterministic checks and an LLM judge is not philosophical — it is practical. Deterministic checks are fast, cheap, and perfectly reproducible. A judge is slower, costs more, and introduces its own noise. The rule is simple: use a deterministic check whenever you can define the correctness criterion mechanically. Use a judge when you cannot.

Eval type When to use Strengths Limitations
Exact match Structured extraction, classification, routing decisions Zero cost, perfectly reproducible, instant Fails on valid paraphrases; brittle to whitespace/casing
Regex Format validation (email, date, invoice number, postcode) Near-zero cost, covers format families Cannot assess semantic correctness
JSON schema Structured output compliance (tool call returns, API responses) Catches missing fields, wrong types, range violations Does not check whether the values are correct, only whether they are valid
Semantic similarity Paraphrase-tolerant matching, short-answer QA Handles restatements and synonyms Requires embedding call; threshold tuning needed per task
LLM-as-judge Open-ended generation, tone, reasoning quality, policy adherence Handles any quality dimension a human can articulate Slower, costs tokens, needs calibration to avoid verbosity/position bias

In practice, most real evaluation suites use all five. A structured output task gets exact match plus JSON schema validation. A short-answer task gets semantic similarity. A summarisation or explanation task gets an LLM judge. A good heuristic: start with the cheapest check that covers the criterion, then escalate only when it fails to distinguish good outputs from bad ones.

Deterministic evals in Python are straightforward. Here is a minimal implementation of all four non-judge types:

import re
import json
import jsonschema
import anthropic

client = anthropic.Anthropic()

def eval_exact(actual: str, expected: str) -> bool:
    return actual.strip() == expected.strip()

def eval_regex(actual: str, pattern: str) -> bool:
    return bool(re.search(pattern, actual))

def eval_json_schema(actual: str, schema: dict) -> bool:
    try:
        data = json.loads(actual)
        jsonschema.validate(data, schema)
        return True
    except (json.JSONDecodeError, jsonschema.ValidationError):
        return False

def eval_semantic_similarity(
    actual: str,
    expected: str,
    threshold: float = 0.85,
) -> bool:
    """
    Uses the Anthropic embeddings endpoint (or any embedder) to compute
    cosine similarity. Replace with your preferred embedding provider.
    """
    from numpy import dot
    from numpy.linalg import norm

    def embed(text: str) -> list[float]:
        # Placeholder: swap in your embedding call
        # e.g. openai.embeddings.create or voyage.embed
        raise NotImplementedError("Replace with your embedding provider")

    a, b = embed(actual), embed(expected)
    similarity = dot(a, b) / (norm(a) * norm(b))
    return float(similarity) >= threshold

Calibrating your LLM judge: prompt and rubric design

An uncalibrated LLM judge is worse than useless — it gives you the illusion of measurement without the accuracy. The three most common failure modes are verbosity bias (longer answers score higher regardless of quality), position bias (the first option in a comparison scores higher regardless of which is better), and sycophancy (the judge agrees with whatever framing the prompt implies). All three are fixable with deliberate prompt engineering and calibration examples.

A well-structured judge prompt has four parts: a role definition that establishes what quality means for this task, a rubric that defines what each score means in concrete terms, calibration examples that show the judge what a 2/5 and a 4/5 look like (not just a 5/5), and an output format instruction that forces structured output so you can parse the score reliably.

import anthropic
import json

client = anthropic.Anthropic()

JUDGE_SYSTEM_PROMPT = """You are an expert evaluator assessing the quality of AI-generated responses.
Your task is to score the response on a 1-5 scale according to the rubric below.

## Scoring Rubric

5 — Excellent: Fully correct, appropriately concise, clear, and directly addresses the question. No unnecessary content.
4 — Good: Correct with minor gaps or slight verbosity. Would satisfy a careful reader.
3 — Acceptable: Mostly correct but missing a key point, or correct but significantly over-verbose.
2 — Poor: Contains a factual error or misses the main point, though some parts are correct.
1 — Failing: Wrong, harmful, off-topic, or refuses a reasonable request.

## Anti-bias Instructions

- Length is NOT a quality signal. A concise 2-sentence answer can score 5. A verbose 10-sentence answer can score 1.
- Do NOT be influenced by confident tone. An answer can sound authoritative and be wrong.
- Do NOT assume the response is good because it is grammatically correct.

## Output Format

Respond with a JSON object only. No prose outside the JSON.
{
  "score": ,
  "reasoning": "",
  "issues": [""]
}
"""

def llm_judge(
    task_description: str,
    input_text: str,
    response: str,
    model: str = "claude-sonnet-4-6",
) -> dict:
    """
    Returns {"score": int, "reasoning": str, "issues": list[str], "passed": bool}
    """
    user_prompt = f"""## Task Description
{task_description}

## Input Given to the System
{input_text}

## System Response to Evaluate
{response}

Evaluate the response according to the rubric and return your JSON assessment."""

    result = client.messages.create(
        model=model,
        max_tokens=256,
        system=JUDGE_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_prompt}],
    )
    raw = result.content[0].text.strip()

    # Strip markdown code fences if present
    if raw.startswith("```"):
        raw = raw.split("```")[1]
        if raw.startswith("json"):
            raw = raw[4:]

    parsed = json.loads(raw)
    parsed["passed"] = parsed["score"] >= 4  # configurable threshold
    return parsed


# Calibration: run the judge on known examples and verify scores match expectations
CALIBRATION_CASES = [
    {
        "task": "Summarise this earnings call in under 80 words.",
        "input": "Revenue grew 12% year-on-year to £480m...",
        "response": "Revenue rose 12% to £480m. Margins expanded 2pp. Management guided for continued growth next quarter.",
        "expected_min_score": 4,  # concise, accurate
    },
    {
        "task": "Summarise this earnings call in under 80 words.",
        "input": "Revenue grew 12% year-on-year to £480m...",
        "response": "The earnings call was very interesting and covered many topics. The company discussed its revenue, which went up. There was also discussion of margins and future plans. Overall it seemed positive.",
        "expected_max_score": 3,  # vague, padded, no numbers
    },
]

def check_judge_calibration() -> bool:
    all_pass = True
    for case in CALIBRATION_CASES:
        result = llm_judge(case["task"], case["input"], case["response"])
        if "expected_min_score" in case and result["score"] < case["expected_min_score"]:
            print(f"CALIBRATION FAIL: expected >= {case['expected_min_score']}, got {result['score']}")
            all_pass = False
        if "expected_max_score" in case and result["score"] > case["expected_max_score"]:
            print(f"CALIBRATION FAIL: expected <= {case['expected_max_score']}, got {result['score']}")
            all_pass = False
    return all_pass
Watch out

Run check_judge_calibration() whenever you change the judge system prompt. A prompt edit that looks like a small clarification can shift the score distribution by a full point, which means historical baselines are no longer comparable. Version your judge prompt in the same file as your golden set and record which judge version was used for each eval run.

For high-stakes tasks, add an agreement check: run the same case twice with position-swapped inputs (in pairwise comparisons) and verify the judge scores consistently. If swapping the order of two responses changes the winner, your calibration needs more examples. Agreement rates below 80% on your calibration set are a warning sign before you trust the judge on your golden set.

Building production eval systems? Get found.

If you're building production eval systems, your experience is exactly what hiring teams are looking for. Add your Verified Builder profile to get found by teams who need this skill.

Add your Builder profile →

Regression loops: CI for your LLM app

The evaluation suite is only useful if it runs automatically. An eval harness that you run manually — when you remember, before a big release — catches maybe 30% of regressions. The same harness wired into GitHub Actions and set to fail the build on a regression threshold catches them all, before they merge.

The pattern is straightforward: on every pull request, run the fast tier of evals (deterministic checks plus Haiku-based judges). On merge to main, run the full suite. Compare the pass rate against the golden baseline stored in the repository. If the regression exceeds the threshold, fail the workflow and post a summary as a PR comment.

# .github/workflows/llm-evals.yml
name: LLM Evaluation Suite

on:
  pull_request:
    paths:
      - "src/**"
      - "prompts/**"
      - "evals/**"
  push:
    branches: [main]

jobs:
  eval-fast-tier:
    name: Fast evals (deterministic + Haiku)
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run fast eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python evals/run_evals.py \
            --tier fast \
            --golden-set evals/golden_set.jsonl \
            --baseline evals/baselines/main_baseline.json \
            --regression-threshold 0.05 \
            --output evals/results/pr_results.json

      - name: Post results comment
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(
              fs.readFileSync('evals/results/pr_results.json', 'utf8')
            );
            const body = [
              '## Eval Results',
              `Pass rate: **${(results.pass_rate * 100).toFixed(1)}%** `
              + `(baseline: ${(results.baseline_pass_rate * 100).toFixed(1)}%)`,
              results.regression_detected
                ? `**REGRESSION DETECTED**: ${(results.regression_pct * 100).toFixed(1)}% drop`
                : 'No regression detected.',
              '',
              '| Category | Pass | Fail | Pass rate |',
              '|---|---|---|---|',
              ...results.by_category.map(c =>
                `| ${c.name} | ${c.pass} | ${c.fail} | ${(c.pass_rate * 100).toFixed(0)}% |`
              ),
            ].join('\n');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body,
            });

  eval-full-suite:
    name: Full eval suite (Sonnet judges)
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    needs: eval-fast-tier
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run full eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python evals/run_evals.py \
            --tier full \
            --golden-set evals/golden_set.jsonl \
            --baseline evals/baselines/main_baseline.json \
            --regression-threshold 0.05 \
            --output evals/results/main_results.json

      - name: Update baseline on pass
        run: |
          cp evals/results/main_results.json \
             evals/baselines/main_baseline.json
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add evals/baselines/main_baseline.json
          git commit -m "chore: update eval baseline [skip ci]" || true
          git push

The regression threshold of 5% means: if the overall pass rate on the golden set drops by more than 5 percentage points compared to the stored baseline, the workflow fails. This blocks the pull request from merging. Teams with higher-stakes applications — healthcare, financial advice, legal summarisation — often set this to 2%. The threshold should be deliberate, documented, and reviewed periodically as the golden set grows and becomes more representative.

Deployment context

When running evals in CI for applications deployed to AWS Mumbai (ap-south-1) or AWS London (eu-west-2), pin the model version explicitly in your eval runner. Cross-region latency in evaluation runs is acceptable; what is not acceptable is a model alias like claude-sonnet-4-6 silently resolving to a newer version mid-run, making your baseline comparison meaningless. Use the full model ID in eval scripts and update it intentionally.

Cost-aware evaluation: routing by complexity

The economics of evaluation are easy to get wrong in both directions. Running every eval case through Opus on every PR is prohibitively expensive and slow. Running everything through Haiku misses the nuanced quality dimensions that only a stronger judge can assess. The solution is tiered routing: match the model to the complexity of what you are measuring.

Tier Model Cost (input / output) Best for Typical cost per 300-case run
Fast claude-haiku-4-5-20251001 $0.80 / $4.00 per MTok Format checks, classification, on-topic binary, simple QA ~$0.04
Mid claude-sonnet-4-6 $3.00 / $15.00 per MTok Summarisation quality, tone, instruction following, reasoning steps ~$0.80
Deep claude-opus-4-5 (or latest Opus) $15.00 / $75.00 per MTok Complex multi-step reasoning, policy compliance, nuanced safety checks ~$4.00

A practical cost-routing function selects the model based on the complexity tag of the test case. You set those tags when you build the golden set:

import anthropic

client = anthropic.Anthropic()

# Model identifiers — update when you intentionally upgrade
MODEL_FAST  = "claude-haiku-4-5-20251001"
MODEL_MID   = "claude-sonnet-4-6"
MODEL_DEEP  = "claude-opus-4-5"          # or latest Opus variant

def select_judge_model(case: "GoldenTestCase") -> str:
    """
    Route to the cheapest model that can reliably assess the eval dimension.

    Rules (applied in order):
    1. Deterministic eval types never need a model — caller should not call this.
    2. Cases tagged "safety", "policy", or "multi-step-reasoning" get Opus.
    3. Cases tagged "summarisation", "tone", "reasoning" get Sonnet.
    4. Everything else gets Haiku.
    """
    if case.eval_type != "judge":
        raise ValueError(f"Case {case.id} uses {case.eval_type}, not judge — no model needed.")

    deep_tags = {"safety", "policy", "multi-step-reasoning", "compliance", "legal"}
    mid_tags  = {"summarisation", "tone", "reasoning", "explanation", "nuance"}

    case_tags = set(case.tags)
    if case_tags & deep_tags:
        return MODEL_DEEP
    if case_tags & mid_tags:
        return MODEL_MID
    return MODEL_FAST


def run_eval_case(case: "GoldenTestCase", system_response: str) -> dict:
    """
    Runs a single evaluation case and returns a result dict.
    Routes deterministic checks without any API call.
    Routes judge calls to the appropriate cost tier.
    """
    result = {
        "id": case.id,
        "eval_type": case.eval_type,
        "tags": case.tags,
        "model_used": None,
        "passed": False,
        "score": None,
        "reasoning": None,
    }

    if case.eval_type == "exact":
        result["passed"] = eval_exact(system_response, case.expected_output)
        result["score"] = 5 if result["passed"] else 1

    elif case.eval_type == "regex":
        pattern = case.metadata.get("pattern", "")
        result["passed"] = eval_regex(system_response, pattern)
        result["score"] = 5 if result["passed"] else 1

    elif case.eval_type == "schema":
        schema = case.metadata.get("schema", {})
        result["passed"] = eval_json_schema(system_response, schema)
        result["score"] = 5 if result["passed"] else 1

    elif case.eval_type == "judge":
        model = select_judge_model(case)
        result["model_used"] = model
        task_desc = case.metadata.get("task_description", "Complete the given task.")
        judgment = llm_judge(
            task_description=task_desc,
            input_text=str(case.input),
            response=system_response,
            model=model,
        )
        result["passed"]    = judgment["score"] >= case.min_judge_score
        result["score"]     = judgment["score"]
        result["reasoning"] = judgment["reasoning"]

    return result

With this routing in place, the fast PR tier routes almost all cases to Haiku, making it feasible to run on every commit. The full Sonnet suite runs on merges to main. Opus is reserved for the small subset of safety and policy cases that genuinely require it — typically 10 to 15% of a golden set at most.

For teams running evaluations from AWS Mumbai (ap-south-1) or AWS London (eu-west-2), use Anthropic's regional API endpoints to reduce latency on eval runs. Evaluation throughput — how quickly you can complete a 300-case run — matters when the CI build is blocking a deploy. Parallel eval execution with asyncio and rate-limit-aware concurrency controls can bring a 300-case Haiku run down to under 30 seconds.

Reporting and action thresholds

Raw pass/fail counts are useful during development. In production, you need structured reporting that tells you not just whether the suite passed, but which categories regressed, by how much, and whether the regression is within a known distribution or a step-change that suggests a systemic problem.

Metric Definition Fail threshold Warn threshold
Overall pass rate % of cases where passed == True > 5% regression > 2% regression
Mean judge score Average score across judge-evaluated cases (1–5 scale) Drop > 0.3 points Drop > 0.1 points
Category pass rate Pass rate per tag group (e.g. "edge-case", "safety") Any category > 10% regression Any category > 5% regression
Fail concentration % of fails concentrated in a single tag N/A — flag for investigation > 60% fails in one tag
Judge model agreement % agreement between fast and full-suite scores on overlapping cases < 75% < 85%

The fail concentration metric is particularly useful for diagnosing the nature of a regression. If 80% of your failures are concentrated in the "long-input" tag after a prompt change, the prompt change broke long-input handling specifically — you do not need to revert the entire change, you need to fix how it handles long inputs. Without this breakdown, a regression looks like a global quality drop when it is actually a targeted failure mode.

Store every eval run result as a JSON artefact in your CI system and ship the run ID, pass rate, and baseline delta to your observability platform. This gives you a quality timeline alongside your deployment history, which makes it possible to answer the question: "Was it the 14 May deploy or the 18 May prompt change that caused the quality drop we noticed on 20 May?"

For a deeper look at the observability instrumentation layer that sits alongside your eval loop, see our guide on agent observability with OpenTelemetry — the two practices are complementary and share the same tracing infrastructure.

Eval patterns for common use cases

The general framework above applies to any LLM application. Two specific patterns come up often enough to deserve concrete examples: evaluating a RAG pipeline and evaluating an agent that takes multi-step actions.

RAG pipeline evaluation

A RAG system has two failure modes: retrieval failure (the right passage was not retrieved) and generation failure (the right passage was retrieved but the model gave a wrong or hallucinated answer). A good eval suite tests both separately, because they have different fixes. If you want to go deeper on the retrieval side specifically, our guide on production RAG with hybrid retrieval covers the retrieval-quality metrics in detail. The eval below focuses on the generation side — given a set of retrieved passages, does the answer faithfully reflect them?

import anthropic
import json

client = anthropic.Anthropic()

RAG_FAITHFULNESS_JUDGE_PROMPT = """You are evaluating whether an AI answer is faithful to the retrieved context.
Faithful means every claim in the answer is supported by the context — the model must not add facts not present in the context.

Score on a 1-5 scale:
5 — Fully faithful: every claim is directly supported by the context.
4 — Mostly faithful: one minor point is not in the context but is not materially wrong.
3 — Partially faithful: some claims lack support; hedging language used but key facts are unsupported.
2 — Mostly hallucinated: several claims are invented or contradict the context.
1 — Fully hallucinated: the answer ignores the context entirely.

Length is not a quality signal. A one-sentence faithful answer beats a long hallucinated one.

Output JSON only: {"score": int, "unsupported_claims": ["..."], "reasoning": "..."}
"""

def eval_rag_faithfulness(
    question: str,
    context_passages: list[str],
    answer: str,
    model: str = "claude-sonnet-4-6",
) -> dict:
    context_block = "\n\n---\n\n".join(
        f"[Passage {i+1}]\n{p}" for i, p in enumerate(context_passages)
    )
    user_prompt = f"""## Question
{question}

## Retrieved Context
{context_block}

## Answer to Evaluate
{answer}"""

    result = client.messages.create(
        model=model,
        max_tokens=256,
        system=RAG_FAITHFULNESS_JUDGE_PROMPT,
        messages=[{"role": "user", "content": user_prompt}],
    )
    raw = result.content[0].text.strip()
    if raw.startswith("```"):
        raw = raw.split("```")[1].lstrip("json").strip()
    return json.loads(raw)


# RAG golden case structure
@dataclass
class RAGGoldenCase:
    id: str
    question: str
    context_passages: list[str]
    reference_answer: str          # human-verified correct answer
    expected_faithfulness: int = 4  # minimum acceptable faithfulness score
    tags: list[str] = field(default_factory=list)

Agent evaluation

Agent evaluation is harder than single-turn evaluation because the output is a sequence of actions rather than a single response. The key metrics are task completion rate (did the agent achieve the goal?), step efficiency (did it take the minimal reasonable number of steps?), and tool call correctness (did each tool call have the right arguments?). For multi-step agents, use a trajectory judge that evaluates the full sequence rather than the final output alone.

import anthropic
import json
from dataclasses import dataclass, field

client = anthropic.Anthropic()

@dataclass
class AgentTrajectory:
    """Records what an agent actually did on a task."""
    task: str
    steps: list[dict]   # each: {"tool": str, "args": dict, "result": str}
    final_output: str
    total_steps: int

AGENT_TRAJECTORY_JUDGE_PROMPT = """You are evaluating an AI agent's trajectory for completing a task.
Assess three dimensions and return a JSON object.

## Dimensions

1. task_completion (1-5): Did the agent successfully complete the task?
   5 = fully completed, correct result
   3 = partially completed or correct result via incorrect path
   1 = task failed or wrong result

2. step_efficiency (1-5): Did the agent use a reasonable number of steps?
   5 = optimal or near-optimal path
   3 = some unnecessary steps but reasonable
   1 = excessive redundant steps or loops

3. tool_correctness (1-5): Were tool calls made with correct arguments?
   5 = all tool calls correct
   3 = minor argument errors, recovered
   1 = critical tool call errors

Output JSON only:
{
  "task_completion": int,
  "step_efficiency": int,
  "tool_correctness": int,
  "overall": float,  // average of the three
  "issues": ["..."],
  "reasoning": "..."
}
"""

def eval_agent_trajectory(
    trajectory: AgentTrajectory,
    max_expected_steps: int,
    model: str = "claude-sonnet-4-6",
) -> dict:
    steps_summary = "\n".join(
        f"Step {i+1}: Called {s['tool']}({json.dumps(s['args'])}) → {s['result'][:120]}"
        for i, s in enumerate(trajectory.steps)
    )
    user_prompt = f"""## Task
{trajectory.task}

## Agent Trajectory ({trajectory.total_steps} steps; expected max: {max_expected_steps})
{steps_summary}

## Final Output
{trajectory.final_output}"""

    result = client.messages.create(
        model=model,
        max_tokens=300,
        system=AGENT_TRAJECTORY_JUDGE_PROMPT,
        messages=[{"role": "user", "content": user_prompt}],
    )
    raw = result.content[0].text.strip()
    if raw.startswith("```"):
        raw = raw.split("```")[1].lstrip("json").strip()
    parsed = json.loads(raw)
    parsed["passed"] = parsed["overall"] >= 4.0
    return parsed

For agent evals, golden cases should include both the task and the expected trajectory shape: which tools should be called, in what order, and with what class of arguments. Exact argument matching is often too brittle (UUIDs and timestamps vary), but you can assert tool call sequence and argument types. Combine trajectory judging with final-output judging for the best coverage.

The skills that underpin a strong evaluation practice — knowing how to write calibrated judge prompts, structure regression loops, and reason about cost routing — are the same skills that distinguish experienced AI engineers from those who have only built demos. If you want to understand where the talent gap sits in 2026, the piece on the agentic AI skills gap maps the landscape well. Evaluation discipline is near the top of what separates builders who ship reliable systems from those who are still fighting fires in production.

For the cost side of your broader LLM infrastructure, our guide on LLM cost optimisation with caching, routing, and compression covers the adjacent techniques — prompt caching, model routing by complexity, and context compression — that make it economical to run evals at the scale production demands. Evaluation and cost optimisation are the two practices that separate a robust production system from an expensive prototype.