Cost Optimisation Guide · 14 June 2026 · 10 min read

LLM Cost Optimisation 2026: Distillation, Semantic Caching, and Smart Model Routing

Q: What is the difference between semantic caching and Anthropic's prompt caching?

They operate at completely different layers. Anthropic's prompt caching is a provider-side feature that stores the processed form of a static prefix in your request — the system prompt, tool schemas — so that subsequent calls sharing that same prefix are billed at a discounted cached-input rate. It only matches exact prefixes. Semantic caching lives entirely in your application: you embed the incoming query, search a vector store for previous queries with high cosine similarity (typically 0.92–0.95), and return the stored response without making an API call at all. Semantic caching catches queries that are phrased differently but mean the same thing; prompt caching does not. The two techniques are complementary and stack safely.

Q: When does model distillation make financial sense?

Distillation has a meaningful upfront cost — GPU time for the fine-tuning run, engineering time to collect and curate training data, and evaluation effort before you can gate traffic to the student. As a rule of thumb, distillation becomes financially viable when you are serving more than roughly 100,000 requests per day on a stable, well-defined task. Below that volume the API cost you are avoiding is likely smaller than the engineering and compute cost of building and maintaining the distilled model. The task also needs to be stable enough that the student model you train today still covers the queries you are serving in three months, which rules out tasks where the underlying knowledge or query distribution shifts rapidly.

Q: How do I decide which model to route each request to?

Start with a simple heuristic: classify requests by task type and token length, and map each combination to the cheapest model whose quality bar you can verify. A short classification or entity-extraction query goes to Haiku 4.5; a coding or explanation task goes to Sonnet 4.6; a complex multi-step reasoning task goes to Opus 4.8; a task requiring frontier capability goes to Fable 5. Once you have production data, replace the heuristic with a lightweight classifier — a fine-tuned small model or a prompt sent to Haiku itself — that scores each incoming query. The key discipline is evaluation: widen the budget tier only as fast as a held-out eval set proves quality holds. An escalation path that retries borderline answers on the next tier up is essential insurance.

Q: Can I use distillation with proprietary models like Fable 5?

You can use a proprietary model as the teacher to generate labelled outputs, provided you comply with the provider's terms of service — Anthropic's terms permit using Claude outputs to fine-tune other models for your own use, but you should review the current terms before proceeding. The student model must be an open-weight model you control (for example Qwen 3.5, Llama 4, or Mistral), because you cannot fine-tune the proprietary model itself. The result is a student that runs on your own infrastructure at a fraction of the API cost, but that requires you to manage hosting, scaling, and model updates yourself. For most teams at sufficient scale, that operational overhead is well worth the 10–100× inference cost reduction.

Q: How long does it take to see savings from these three levers?

The three levers have very different implementation timelines. Model routing — even a simple heuristic version — can ship in roughly one week and immediately cuts your average cost per query wherever cheaper models can handle the traffic. Semantic caching takes two to three weeks to build properly, including embedding infrastructure, vector store setup, similarity threshold tuning, and cache invalidation logic. Distillation is the longest investment: collecting training data, running the fine-tuning, evaluating quality, and gating production traffic typically takes four to eight weeks for a first run, and you should plan for iteration. The practical advice is to ship routing first — you see savings almost immediately — and then add semantic caching. Reserve distillation for when you have the traffic volume to justify it and the evaluation discipline to gate it safely.

As of June 2026, the difference between an LLM product that scales profitably and one that haemorrhages budget is rarely the model — it is the engineering layer around the model. This guide covers three advanced levers that remain widely underused: semantic caching (which catches the same question phrased differently), model distillation (which trains a cheap student from an expensive teacher), and 2026-tier model routing (with updated pricing across Fable 5, Opus 4.8, Sonnet 4.6, and Haiku 4.5). Applied together, these three compound to 50–90% reductions on realistic production workloads. Code, decision trees, and a worked cost model throughout.

AI Tech Connect editorial Published 14 June 2026

Why 50–90% of Your Inference Bill Is Avoidable

Most teams ship their first LLM feature with a single model wired to every code path. That model is almost always a frontier model, because it was the easiest thing to validate during the prototype. The bill that arrives at month-end is treated as the cost of doing business. It is not. Industry analysis consistently shows that enterprises without a deliberate cost strategy routinely overspend 50–90% on inference — not because the models are expensive, but because the models are used carelessly.

The root cause is always some combination of three patterns: routing everything to the most capable and most expensive model regardless of query difficulty; calling the model fresh for queries you have answered before; and running inference at API scale with a large proprietary model when a smaller, purpose-built model on your own infrastructure would serve just as well for the specific task. Each of these patterns has a well-understood fix, and the fixes compound.

This article covers the three missing levers that our companion piece does not address. If you have not yet implemented prompt caching and prompt compression, start with The Cache, Route, Compress Playbook — those techniques come first because they are faster to ship and carry less risk. What follows here goes a layer deeper: semantic caching that catches "same question, different words"; updated 2026 model routing tiers with concrete routing logic; and model distillation, the training-time technique that takes you off the API pricing curve entirely for well-defined tasks.

The 2026 Model Pricing Landscape — Your Routing Foundation

Before you can route intelligently, you need a clear picture of what you are routing between. As of June 2026, the Anthropic model family spans a 12.5× cost spread from the most capable frontier model to the fastest budget tier. That gap is the foundation of every routing strategy.

Model	Input (per MTok)	Output (per MTok)	Context window	Best for
Claude Fable 5	$10	$50	200K	Frontier reasoning, complex multi-step agents, creative synthesis
Claude Opus 4.8	$5	$25	200K	Hard analysis, nuanced judgement, long-document processing
Claude Sonnet 4.6	$3	$15	200K	Coding, explanation, structured extraction, everyday chat
Claude Haiku 4.5	$0.80	$4	200K	Classification, short generation, high-throughput low-latency tasks

Pricing as of June 2026. Check the Fable 5 launch article for the most current figures before modelling your own bill.

The arithmetic is stark: a request that costs $10 per million input tokens on Fable 5 costs $0.80 on Haiku — a 12.5× difference. If 1 million requests per day at 750 tokens average all flow to Fable 5, you are paying roughly $7,500 per day in input costs alone. Route 70% of that same traffic to Haiku and the input cost for that share drops to $420. The routing decision is not a nuance; it is the largest single lever available at the API layer.

The routing principle is simple: send each request to the cheapest model that still passes your quality bar. The engineering challenge is deciding what "passes quality" means and measuring it reliably. Here is a starting-point routing function in Python that uses task type and query length as heuristics — adequate for many teams and easy to instrument before you replace it with a classifier.

def route_model(query: str, task_type: str) -> str:
    """
    Heuristic model router — replace with a fine-tuned classifier in production.
    Returns an Anthropic model ID for the cheapest tier that should handle this query.
    """
    if task_type == "classification" or len(query) < 200:
        # Short, well-defined tasks: budget tier
        return "claude-haiku-4-5-20251001"
    elif task_type == "coding" or "explain" in query.lower():
        # Structured reasoning and explanation: mid tier
        return "claude-sonnet-4-6"
    elif task_type == "complex_reasoning":
        # Hard analysis requiring depth: upper-mid tier
        return "claude-opus-4-8"
    else:
        # Frontier tasks only — default is intentionally conservative
        return "claude-fable-5"

A production routing layer will be more sophisticated: a lightweight classifier (possibly Haiku itself) that scores each incoming query for complexity, a model registry that maps scores to model IDs, and a shadow evaluation pipeline that compares routed answers to a frontier baseline on a held-out set. The heuristic above is a safe starting point. Do not widen your budget tier faster than your evaluation data proves it is safe — a 5% quality regression from over-routing is likely to cost far more than the API saving that triggered it.

Pro tip: Build your routing decision table before writing the classifier. Map your query types to model tiers manually on 100 production examples first. That exercise almost always reveals that a larger share of traffic is genuinely simple than you expected — which makes the routing investment easier to justify.

The table below frames the routing decision across the four dimensions that matter in practice.

Query complexity	Latency budget	Quality requirement	Recommended model
Low (classification, short generation)	Tight (<300 ms)	Functional	Haiku 4.5
Medium (explanation, coding, structured extraction)	Moderate (300–2 000 ms)	High	Sonnet 4.6
High (nuanced analysis, long documents)	Flexible (2–10 s)	Very high	Opus 4.8
Frontier (complex agents, creative synthesis)	Flexible (5–30 s)	Maximum	Fable 5

Semantic Caching — Skip the API Call Entirely

Semantic caching is one of the most impactful and most commonly misunderstood optimisations in the LLM cost toolkit. It is frequently confused with Anthropic's prompt caching — they are not the same thing, and understanding the difference determines where each is worth implementing.

Anthropic's prompt caching is a provider-side feature. It stores the processed form of a static prefix in your request — typically the system prompt and tool schemas — and charges a discounted rate for subsequent calls that share that exact prefix. It saves you from paying to re-process tokens the model has already seen. It only works on exact prefix matches, and it is useful for agent loops and chat sessions that resend the same large static context repeatedly. If you have not yet implemented it, see the prompt caching deep-dive.

Semantic caching is an application-layer technique that you build yourself. It embeds the incoming user query into a vector, searches a vector store for previous queries with high cosine similarity, and — if the similarity is above a threshold — returns the stored response without calling the model at all. Its power is that it catches semantically equivalent queries that are phrased differently: "how do I reset my password?" and "I've forgotten my password, what do I do?" will land near each other in embedding space even though they share no words with each other's exact prefix.

The expected saving is 40–60% on workloads with overlapping query patterns — customer support, FAQ chatbots, RAG systems over a fixed document corpus. The latency benefit is often just as compelling: a cache hit returns in milliseconds rather than seconds.

Here is a working implementation using sentence-transformers and FAISS for local use, easily swappable for Redis with vector search or Qdrant in a hosted setup.

import json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import Optional

class SemanticCache:
    """
    Simple semantic cache using sentence-transformers for embedding and FAISS for ANN search.
    Swap the index for a Redis/Qdrant client in production.
    """

    def __init__(self, model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.92):
        self.encoder = SentenceTransformer(model_name)
        self.threshold = threshold
        self.dimension = 384  # all-MiniLM-L6-v2 output dimension
        self.index = faiss.IndexFlatIP(self.dimension)  # inner-product = cosine on normalised vectors
        self.entries: list[dict] = []  # parallel list: {query, response, doc_hash}

    def _embed(self, text: str) -> np.ndarray:
        vec = self.encoder.encode([text], normalize_embeddings=True)
        return vec.astype("float32")

    def lookup(self, query: str) -> Optional[str]:
        """Return a cached response if a similar query exists above threshold, else None."""
        if self.index.ntotal == 0:
            return None
        vec = self._embed(query)
        distances, indices = self.index.search(vec, k=1)
        best_score = float(distances[0][0])
        best_idx   = int(indices[0][0])
        if best_score >= self.threshold:
            return self.entries[best_idx]["response"]
        return None

    def store(self, query: str, response: str, doc_hash: str = "") -> None:
        """Add a query-response pair to the cache, keyed by document hash for invalidation."""
        vec = self._embed(query)
        self.index.add(vec)
        self.entries.append({"query": query, "response": response, "doc_hash": doc_hash})

    def invalidate_by_doc(self, doc_hash: str) -> None:
        """
        Remove all entries associated with a given document hash.
        In production, rebuild the FAISS index from the remaining entries
        (or use a filterable vector store like Qdrant that supports deletion by metadata).
        """
        surviving = [e for e in self.entries if e["doc_hash"] != doc_hash]
        self.entries = surviving
        # Rebuild index from surviving entries
        self.index = faiss.IndexFlatIP(self.dimension)
        if surviving:
            vecs = np.vstack([self._embed(e["query"]) for e in surviving])
            self.index.add(vecs)


# Usage in a request handler
cache = SemanticCache(threshold=0.93)

def get_answer(query: str, model_client, model_id: str, doc_hash: str = "") -> str:
    cached = cache.lookup(query)
    if cached:
        return cached  # no API call made

    response = model_client.messages.create(
        model=model_id,
        max_tokens=1024,
        messages=[{"role": "user", "content": query}],
    )
    answer = response.content[0].text
    cache.store(query, answer, doc_hash=doc_hash)
    return answer

Watch out — cache invalidation: A semantic cache that returns a stored answer for a query about data that has since changed will serve stale information confidently and cheaply. Always associate each cached entry with the hash of the underlying document(s) that informed the answer. When a document is updated, call invalidate_by_doc(doc_hash) immediately. If you cannot track document hashes, set a time-to-live (TTL) that is shorter than your content refresh cycle.

Choosing your similarity threshold requires tuning against your specific query distribution. A threshold that is too low causes false hits — returning a stored answer to a query that was similar but not equivalent, which quietly serves wrong responses. A threshold too high makes the cache ineffective. The range 0.92–0.95 is a reasonable starting point for most English-language support and FAQ workloads. Measure your false-hit rate on a labelled sample before moving to production; if a false hit would embarrass you or mislead a user, be conservative.

For workloads that span a rapidly changing knowledge base — news summarisation, live pricing, real-time status updates — semantic caching is not appropriate unless you have cache invalidation nailed. Apply it where the underlying data is stable: policy documents, product FAQs, technical documentation, internal knowledge bases with infrequent updates.

Pro tip: Log every semantic cache hit and the actual similarity score. Review a sample of hits weekly for the first month. The false-hit patterns that emerge will tell you whether to raise your threshold, narrow your cache scope, or add query pre-processing to normalise the inputs before embedding.

Model Distillation — Training a Cheap Student From an Expensive Teacher

Model distillation is categorically different from the API-layer techniques above. It does not change how you call a model; it changes which model you are calling. The goal is to train a small, fast, cheap open-weight model — the student — to replicate the behaviour of a large expensive model — the teacher — on a specific, narrow task distribution. If it works, you move that task's inference off the API pricing curve entirely and onto your own infrastructure at a fraction of the cost.

It is worth being precise about what distillation is and is not. It is not general knowledge compression: you cannot meaningfully distil GPT-4 or Fable 5's breadth of knowledge into a 7B model. What you can do is train a 7B model to match a frontier model's behaviour on your specific task distribution — the exact queries your users send, formatted the way your application formats them, with the exact type of outputs your system expects. On that narrow slice, an 85–95% quality match to the teacher is achievable. Outside that slice, the student may fail badly.

That distinction determines when distillation is viable: it requires a stable, well-defined task. If your query distribution shifts substantially over time — if new products, new use cases, or new user populations change what people are asking — the student you trained last month will slowly drift out of alignment and degrade silently. Plan for periodic re-distillation if you ship it on a dynamic workload.

The distillation pipeline has five steps:

Sample your production distribution. Collect 5,000–50,000 representative queries from your production logs. Deduplicate, clean, and hold out 10–15% as an evaluation set that will not be used for training.
Generate teacher completions. Send each training query to your teacher model (Fable 5, Opus 4.8, or another frontier model) and collect the outputs. For reasoning tasks, use chain-of-thought prompting in the teacher — the student learns better reasoning patterns when the teacher's outputs include explicit reasoning steps rather than just final answers.
Fine-tune the student with LoRA. Train a small open-weight model (Qwen 3.5, Llama 4, Mistral, or similar) on the teacher-labelled data using supervised fine-tuning and LoRA for parameter-efficient training. Tools: HuggingFace transformers + peft, or Unsloth for faster iteration on a single GPU.
Evaluate on the held-out set. Compare the student's outputs to the teacher's outputs on queries neither saw during training. Measure task-specific quality metrics, not just perplexity.
Gate by quality threshold. Only route traffic to the student if it passes your quality bar on the held-out eval. Do not gate by a single aggregate metric — break the eval down by query sub-type so you know where the student underperforms.

Here is a minimal teacher sampling script and a LoRA fine-tuning sketch using PEFT.

# Step 1 + 2: Sample production queries and generate teacher completions
import anthropic
import json

client = anthropic.Anthropic()

def generate_teacher_completions(queries: list[str], output_path: str) -> None:
    """
    Generate teacher completions for a list of queries.
    Uses chain-of-thought prompting so the student learns reasoning steps.
    """
    results = []
    for q in queries:
        response = client.messages.create(
            model="claude-fable-5",  # teacher
            max_tokens=2048,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Think step-by-step before giving your final answer.\n\n{q}"
                    ),
                }
            ],
        )
        results.append({"query": q, "completion": response.content[0].text})

    with open(output_path, "w") as f:
        json.dump(results, f, indent=2)
    print(f"Saved {len(results)} teacher completions to {output_path}")

# Step 3: LoRA fine-tuning with PEFT (HuggingFace)
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import json

# Load teacher-labelled data
with open("teacher_completions.json") as f:
    raw = json.load(f)

data = [
    {"text": f"### Query:\n{ex['query']}\n\n### Response:\n{ex['completion']}"}
    for ex in raw
]
dataset = Dataset.from_list(data)

# Load student model (replace with your chosen open-weight model)
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", load_in_4bit=True)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="./student-distilled",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    logging_steps=50,
    save_strategy="epoch",
    fp16=True,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
)
trainer.train()
trainer.save_model("./student-distilled/final")

Watch out — chain-of-thought in teacher outputs: For reasoning tasks, the student's quality degrades noticeably if you train it on teacher outputs that contain only final answers. A student trained on bare answers learns to mimic the teacher's style and tone, but not its reasoning process. Always include chain-of-thought steps in the teacher's outputs for anything that requires multi-step reasoning, extraction with justification, or structured decision-making. The training data volume required for reliable reasoning transfer is typically higher — aim for 20K+ examples for reasoning tasks versus 5–10K for simpler classification or extraction.

Once the student passes your quality gate, the inference cost story changes fundamentally. A 7B model running on a single GPU instance serves hundreds of requests per second; the API cost per request drops to infrastructure cost only. Compared to Fable 5 at $10/$50 per MTok, a self-hosted 7B model typically runs at 10–100× lower cost per request depending on your GPU utilisation rate and the length of your inputs and outputs.

Account for the full cost picture before claiming the win: the fine-tuning run itself costs GPU time (typically a few hundred dollars for a well-equipped A100 run on a 7B model with 20K examples); model hosting requires a persistent GPU instance with the associated fixed cost; and re-distillation every few months adds ongoing engineering overhead. Distillation is the right choice when you are serving well above 100,000 requests per day on a task stable enough to justify the investment. Below that scale, routing and semantic caching are almost always the better ROI.

For a deeper treatment of distillation pipelines and student model selection, see our full distillation guide.

Building cost-efficient LLM infrastructure? Verified Builders get found.

AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free and takes two minutes.

Become a Verified Builder →

Stacking All Three — A Real Cost Model

The true power of these three levers comes from applying them in sequence, because each operates on a smaller bill than the previous one produced. The table below shows what that compounding looks like on a realistic scale: 1 million requests per day at 500 tokens average input, all initially routed to Claude Fable 5.

Stage	What changes	Daily cost	Monthly cost	Cumulative saving
Baseline	All 1M requests to Fable 5 at $10/$50 per MTok	~$7,500	~$225,000	—
+ Routing	70% to Haiku, 20% to Sonnet, 10% to Fable 5	~$1,500	~$45,000	80%
+ Semantic cache	40% cache hit rate on remaining API calls	~$900	~$27,000	88%
+ Distillation	Replace Haiku API calls with self-hosted student model	~$400	~$12,000	95%

These figures are illustrative. The 100-unit index and every percentage are based on the pricing and assumptions above — actual savings depend on your query distribution, cache hit rate, output length, task complexity, and GPU hosting costs for the distilled student. The $400/day distillation stage assumes GPU hosting of roughly $2,000/month for a well-utilised A100 instance serving the Haiku-equivalent traffic.

Reading the table down the saving column makes the compounding plain. Routing produces the largest single jump — 80% in this specific traffic mix — because the expensive frontier model now sees only 10% of traffic. In practice, routing savings range from 50–70% depending on your query distribution; this example uses an optimistic 70/20/10 split. Semantic caching adds another 8 percentage points on top: 40% of the remaining API calls hit the cache and return without touching a model. Distillation then moves the Haiku-routed 70% of traffic off the API entirely, converting a variable per-token cost into a fixed infrastructure cost that is substantially lower at this traffic volume.

The implementation timeline matters too: routing is a week of work, semantic caching is two to three weeks, and distillation is four to eight weeks for the first run. If you are trying to reduce a painful bill quickly, ship routing first — it delivers the biggest absolute saving and takes the least time. The others compound on the smaller bill that routing produces, which makes them easier to prioritise against other engineering work.

For the broader context of why these numbers matter at the infrastructure level, see our FinOps playbook on AI inference cost economics and the companion piece on building profitable AI products in 2026.

India & UK builder note

For teams in India, GPU hosting for distillation is most cost-effective on AWS Mumbai (ap-south-1) or through Neysa's domestic GPU cloud, where spot A100 instances are priced competitively in INR and avoid cross-border data transfer costs. For UK teams, Azure UK South offers H100 on-demand at prices competitive with US East; reserved 1-year pricing cuts this further by roughly 35%. In both markets, the routing and semantic caching levers pay off immediately at API prices, while distillation's ROI calculation should include the regional GPU hosting cost rather than a US-centric default.

What to Measure — Metrics That Catch Savings Erosion

The three levers are not one-time migrations. Each of them requires ongoing measurement to confirm that the savings you expected are materialising and that the quality you promised your users is holding. The metrics below are the minimum instrumentation for a system that uses all three techniques.

Cost per successful request is the north star metric. Not cost per token, not cost per API call — cost per request that returned an answer your application accepted. Token costs can drop while successful-request costs rise if the routing or quality gate starts failing more often and triggering expensive retries on stronger models. Track this metric at the tier level (budget / mid / frontier / cached / distilled) so you can see where erosion is happening.

Cache hit rate is the operational heartbeat of your semantic cache. If it drops below your expected range without a corresponding change in traffic patterns, the most likely cause is document updates that are not being properly invalidated. Set an alert if the cache hit rate falls more than 10 percentage points below its rolling 7-day average. A sudden drop often means a document hash changed without triggering invalidation, so stale entries are being bypassed for correct new answers.

Student model quality vs teacher on production distribution requires periodic blind evaluation rather than a fixed benchmark. Once a month, sample 500–1,000 production queries that have been routed to the student, regenerate answers from the teacher, and evaluate both side-by-side on your task-specific quality metric. A quality gap that is growing over time signals that the production distribution has drifted away from the training distribution and re-distillation is overdue.

Model routing accuracy measures whether queries are landing in the right tier. Run a shadow evaluation: take a stratified sample of routed queries, run each one on the model it was routed to and on the next tier up, and check whether the cheaper model's answer was genuinely adequate. If the tier-match rate is falling — if you are seeing more cases where the budget-tier answer needed escalation — your classifier needs retraining or your thresholds need adjustment.

For the observability infrastructure to collect and alert on these metrics, see the agent observability guide, which covers OpenTelemetry instrumentation for LLM pipelines.

Common Pitfalls and How to Avoid Them

Each of the three levers has a characteristic failure mode, and all of them share one dangerous property: they fail quietly. The bill drops, the dashboard looks clean, and a few days later users start complaining about wrong answers or degraded quality. Understanding the failure modes in advance is the difference between a safe optimisation and a production incident.

Semantic cache serving stale answers. This is the most common failure and it happens when cached responses outlive the data that made them accurate. A user asks about a policy that changed last week; the cache returns the answer you stored three weeks ago because the query is semantically similar; the user acts on outdated information. The fix is cache invalidation by document hash: every cached entry must be associated with the hash or version identifier of the document(s) that informed it, and invalidation must be triggered atomically when those documents change. If you are building on a knowledge base with frequent updates and you cannot instrument document-level invalidation, either use a short TTL or do not use semantic caching for that knowledge base.

Distillation student memorising style instead of reasoning. A student model trained on bare final-answer completions from the teacher will learn to produce text that looks like the teacher's answers without learning how to arrive at them correctly. This produces a model that sounds confident and well-written but makes systematic errors on query types that require reasoning rather than pattern-matching. The fix is always to include explicit reasoning steps — chain-of-thought — in the teacher's training outputs for any task that involves logic, multi-step extraction, or structured decision-making. The additional training data cost is worth it.

Over-routing to cheap models. Routing is seductive because the savings are immediate and large. The temptation, once the budget tier is handling 70% of traffic successfully, is to push it to 80% or 90% to squeeze more out of the bill. Past the quality threshold of the budget model, it starts returning plausible-sounding wrong answers — not errors that your monitoring detects, but incorrect answers that reach users. Guard against this by never widening the budget tier without running the quality gate first on a held-out set that represents the new traffic you are proposing to route there. An escalation path — where a low-confidence budget-tier answer is automatically retried on the next tier — adds safety at relatively low extra cost.

Missing the GPU cost in the distillation ROI calculation. The API saving from distillation is large and visible in your billing dashboard. The infrastructure cost of running the student model is a separate line item — a persistent GPU instance — and it is fixed rather than variable. At low traffic volumes, the fixed GPU cost can exceed the variable API cost you are avoiding, making distillation a negative-ROI decision despite the per-request saving. Always model the full cost, including the fine-tuning run, the GPU instance cost, and the engineering overhead of re-distillation cycles, against the actual API cost you are replacing. The break-even point is typically around 100,000 requests per day for a well-optimised setup; below that, routing and semantic caching almost always have better ROI.

Frequently asked

What is the difference between semantic caching and Anthropic's prompt caching?

They operate at completely different layers. Anthropic's prompt caching is a provider-side feature that stores the processed form of a static prefix in your request — the system prompt, tool schemas — so that subsequent calls sharing that same prefix are billed at a discounted cached-input rate. It only matches exact prefixes. Semantic caching lives entirely in your application: you embed the incoming query, search a vector store for previous queries with high cosine similarity (typically 0.92–0.95), and return the stored response without making an API call at all. Semantic caching catches queries that are phrased differently but mean the same thing; prompt caching does not. The two techniques are complementary and stack safely.

When does model distillation make financial sense?

Distillation has a meaningful upfront cost — GPU time for the fine-tuning run, engineering time to collect and curate training data, and evaluation effort before you can gate traffic to the student. As a rule of thumb, distillation becomes financially viable when you are serving more than roughly 100,000 requests per day on a stable, well-defined task. Below that volume the API cost you are avoiding is likely smaller than the engineering and compute cost of building and maintaining the distilled model. The task also needs to be stable enough that the student model you train today still covers the queries you are serving in three months, which rules out tasks where the underlying knowledge or query distribution shifts rapidly.

How do I decide which model to route each request to?

Start with a simple heuristic: classify requests by task type and token length, and map each combination to the cheapest model whose quality bar you can verify. A short classification or entity-extraction query goes to Haiku 4.5; a coding or explanation task goes to Sonnet 4.6; a complex multi-step reasoning task goes to Opus 4.8; a task requiring frontier capability goes to Fable 5. Once you have production data, replace the heuristic with a lightweight classifier — a fine-tuned small model or a prompt sent to Haiku itself — that scores each incoming query. The key discipline is evaluation: widen the budget tier only as fast as a held-out eval set proves quality holds. An escalation path that retries borderline answers on the next tier up is essential insurance.

Can I use distillation with proprietary models like Fable 5?

You can use a proprietary model as the teacher to generate labelled outputs, provided you comply with the provider's terms of service — Anthropic's terms permit using Claude outputs to fine-tune other models for your own use, but you should review the current terms before proceeding. The student model must be an open-weight model you control (for example Qwen 3.5, Llama 4, or Mistral), because you cannot fine-tune the proprietary model itself. The result is a student that runs on your own infrastructure at a fraction of the API cost, but that requires you to manage hosting, scaling, and model updates yourself. For most teams at sufficient scale, that operational overhead is well worth the 10–100× inference cost reduction.

How long does it take to see savings from these three levers?

The three levers have very different implementation timelines. Model routing — even a simple heuristic version — can ship in roughly one week and immediately cuts your average cost per query wherever cheaper models can handle the traffic. Semantic caching takes two to three weeks to build properly, including embedding infrastructure, vector store setup, similarity threshold tuning, and cache invalidation logic. Distillation is the longest investment: collecting training data, running the fine-tuning, evaluating quality, and gating production traffic typically takes four to eight weeks for a first run, and you should plan for iteration. The practical advice is to ship routing first — you see savings almost immediately — and then add semantic caching. Reserve distillation for when you have the traffic volume to justify it and the evaluation discipline to gate it safely.

Shipping cost-efficient AI at scale? Show it on your Builder profile.

AI Tech Connect is the directory where Indian and UK AI Builders get found by the people hiring and collaborating. If you have cut a production inference bill 50–90% with routing, semantic caching, or distillation, that is exactly the proof-of-work worth showing. Claim your free Founding Builder profile while early spots are open — two minutes, no CV.

Create your free profile Browse Builders

← Back to AI Tips