Why 50–90% of Your Inference Bill Is Avoidable
Most teams ship their first LLM feature with a single model wired to every code path. That model is almost always a frontier model, because it was the easiest thing to validate during the prototype. The bill that arrives at month-end is treated as the cost of doing business. It is not. Industry analysis consistently shows that enterprises without a deliberate cost strategy routinely overspend 50–90% on inference — not because the models are expensive, but because the models are used carelessly.
The root cause is always some combination of three patterns: routing everything to the most capable and most expensive model regardless of query difficulty; calling the model fresh for queries you have answered before; and running inference at API scale with a large proprietary model when a smaller, purpose-built model on your own infrastructure would serve just as well for the specific task. Each of these patterns has a well-understood fix, and the fixes compound.
This article covers the three missing levers that our companion piece does not address. If you have not yet implemented prompt caching and prompt compression, start with The Cache, Route, Compress Playbook — those techniques come first because they are faster to ship and carry less risk. What follows here goes a layer deeper: semantic caching that catches "same question, different words"; updated 2026 model routing tiers with concrete routing logic; and model distillation, the training-time technique that takes you off the API pricing curve entirely for well-defined tasks.
The 2026 Model Pricing Landscape — Your Routing Foundation
Before you can route intelligently, you need a clear picture of what you are routing between. As of June 2026, the Anthropic model family spans a 12.5× cost spread from the most capable frontier model to the fastest budget tier. That gap is the foundation of every routing strategy.
| Model | Input (per MTok) | Output (per MTok) | Context window | Best for |
|---|---|---|---|---|
| Claude Fable 5 | $10 | $50 | 200K | Frontier reasoning, complex multi-step agents, creative synthesis |
| Claude Opus 4.8 | $5 | $25 | 200K | Hard analysis, nuanced judgement, long-document processing |
| Claude Sonnet 4.6 | $3 | $15 | 200K | Coding, explanation, structured extraction, everyday chat |
| Claude Haiku 4.5 | $0.80 | $4 | 200K | Classification, short generation, high-throughput low-latency tasks |
Pricing as of June 2026. Check the Fable 5 launch article for the most current figures before modelling your own bill.
The arithmetic is stark: a request that costs $10 per million input tokens on Fable 5 costs $0.80 on Haiku — a 12.5× difference. If 1 million requests per day at 750 tokens average all flow to Fable 5, you are paying roughly $7,500 per day in input costs alone. Route 70% of that same traffic to Haiku and the input cost for that share drops to $420. The routing decision is not a nuance; it is the largest single lever available at the API layer.
The routing principle is simple: send each request to the cheapest model that still passes your quality bar. The engineering challenge is deciding what "passes quality" means and measuring it reliably. Here is a starting-point routing function in Python that uses task type and query length as heuristics — adequate for many teams and easy to instrument before you replace it with a classifier.
def route_model(query: str, task_type: str) -> str:
"""
Heuristic model router — replace with a fine-tuned classifier in production.
Returns an Anthropic model ID for the cheapest tier that should handle this query.
"""
if task_type == "classification" or len(query) < 200:
# Short, well-defined tasks: budget tier
return "claude-haiku-4-5-20251001"
elif task_type == "coding" or "explain" in query.lower():
# Structured reasoning and explanation: mid tier
return "claude-sonnet-4-6"
elif task_type == "complex_reasoning":
# Hard analysis requiring depth: upper-mid tier
return "claude-opus-4-8"
else:
# Frontier tasks only — default is intentionally conservative
return "claude-fable-5"
A production routing layer will be more sophisticated: a lightweight classifier (possibly Haiku itself) that scores each incoming query for complexity, a model registry that maps scores to model IDs, and a shadow evaluation pipeline that compares routed answers to a frontier baseline on a held-out set. The heuristic above is a safe starting point. Do not widen your budget tier faster than your evaluation data proves it is safe — a 5% quality regression from over-routing is likely to cost far more than the API saving that triggered it.
The table below frames the routing decision across the four dimensions that matter in practice.
| Query complexity | Latency budget | Quality requirement | Recommended model |
|---|---|---|---|
| Low (classification, short generation) | Tight (<300 ms) | Functional | Haiku 4.5 |
| Medium (explanation, coding, structured extraction) | Moderate (300–2 000 ms) | High | Sonnet 4.6 |
| High (nuanced analysis, long documents) | Flexible (2–10 s) | Very high | Opus 4.8 |
| Frontier (complex agents, creative synthesis) | Flexible (5–30 s) | Maximum | Fable 5 |
Semantic Caching — Skip the API Call Entirely
Semantic caching is one of the most impactful and most commonly misunderstood optimisations in the LLM cost toolkit. It is frequently confused with Anthropic's prompt caching — they are not the same thing, and understanding the difference determines where each is worth implementing.
Anthropic's prompt caching is a provider-side feature. It stores the processed form of a static prefix in your request — typically the system prompt and tool schemas — and charges a discounted rate for subsequent calls that share that exact prefix. It saves you from paying to re-process tokens the model has already seen. It only works on exact prefix matches, and it is useful for agent loops and chat sessions that resend the same large static context repeatedly. If you have not yet implemented it, see the prompt caching deep-dive.
Semantic caching is an application-layer technique that you build yourself. It embeds the incoming user query into a vector, searches a vector store for previous queries with high cosine similarity, and — if the similarity is above a threshold — returns the stored response without calling the model at all. Its power is that it catches semantically equivalent queries that are phrased differently: "how do I reset my password?" and "I've forgotten my password, what do I do?" will land near each other in embedding space even though they share no words with each other's exact prefix.
The expected saving is 40–60% on workloads with overlapping query patterns — customer support, FAQ chatbots, RAG systems over a fixed document corpus. The latency benefit is often just as compelling: a cache hit returns in milliseconds rather than seconds.
Here is a working implementation using sentence-transformers and FAISS for local use, easily swappable for Redis with vector search or Qdrant in a hosted setup.
import json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import Optional
class SemanticCache:
"""
Simple semantic cache using sentence-transformers for embedding and FAISS for ANN search.
Swap the index for a Redis/Qdrant client in production.
"""
def __init__(self, model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.92):
self.encoder = SentenceTransformer(model_name)
self.threshold = threshold
self.dimension = 384 # all-MiniLM-L6-v2 output dimension
self.index = faiss.IndexFlatIP(self.dimension) # inner-product = cosine on normalised vectors
self.entries: list[dict] = [] # parallel list: {query, response, doc_hash}
def _embed(self, text: str) -> np.ndarray:
vec = self.encoder.encode([text], normalize_embeddings=True)
return vec.astype("float32")
def lookup(self, query: str) -> Optional[str]:
"""Return a cached response if a similar query exists above threshold, else None."""
if self.index.ntotal == 0:
return None
vec = self._embed(query)
distances, indices = self.index.search(vec, k=1)
best_score = float(distances[0][0])
best_idx = int(indices[0][0])
if best_score >= self.threshold:
return self.entries[best_idx]["response"]
return None
def store(self, query: str, response: str, doc_hash: str = "") -> None:
"""Add a query-response pair to the cache, keyed by document hash for invalidation."""
vec = self._embed(query)
self.index.add(vec)
self.entries.append({"query": query, "response": response, "doc_hash": doc_hash})
def invalidate_by_doc(self, doc_hash: str) -> None:
"""
Remove all entries associated with a given document hash.
In production, rebuild the FAISS index from the remaining entries
(or use a filterable vector store like Qdrant that supports deletion by metadata).
"""
surviving = [e for e in self.entries if e["doc_hash"] != doc_hash]
self.entries = surviving
# Rebuild index from surviving entries
self.index = faiss.IndexFlatIP(self.dimension)
if surviving:
vecs = np.vstack([self._embed(e["query"]) for e in surviving])
self.index.add(vecs)
# Usage in a request handler
cache = SemanticCache(threshold=0.93)
def get_answer(query: str, model_client, model_id: str, doc_hash: str = "") -> str:
cached = cache.lookup(query)
if cached:
return cached # no API call made
response = model_client.messages.create(
model=model_id,
max_tokens=1024,
messages=[{"role": "user", "content": query}],
)
answer = response.content[0].text
cache.store(query, answer, doc_hash=doc_hash)
return answer
invalidate_by_doc(doc_hash) immediately. If you cannot track document hashes, set a time-to-live (TTL) that is shorter than your content refresh cycle.
Choosing your similarity threshold requires tuning against your specific query distribution. A threshold that is too low causes false hits — returning a stored answer to a query that was similar but not equivalent, which quietly serves wrong responses. A threshold too high makes the cache ineffective. The range 0.92–0.95 is a reasonable starting point for most English-language support and FAQ workloads. Measure your false-hit rate on a labelled sample before moving to production; if a false hit would embarrass you or mislead a user, be conservative.
For workloads that span a rapidly changing knowledge base — news summarisation, live pricing, real-time status updates — semantic caching is not appropriate unless you have cache invalidation nailed. Apply it where the underlying data is stable: policy documents, product FAQs, technical documentation, internal knowledge bases with infrequent updates.
Model Distillation — Training a Cheap Student From an Expensive Teacher
Model distillation is categorically different from the API-layer techniques above. It does not change how you call a model; it changes which model you are calling. The goal is to train a small, fast, cheap open-weight model — the student — to replicate the behaviour of a large expensive model — the teacher — on a specific, narrow task distribution. If it works, you move that task's inference off the API pricing curve entirely and onto your own infrastructure at a fraction of the cost.
It is worth being precise about what distillation is and is not. It is not general knowledge compression: you cannot meaningfully distil GPT-4 or Fable 5's breadth of knowledge into a 7B model. What you can do is train a 7B model to match a frontier model's behaviour on your specific task distribution — the exact queries your users send, formatted the way your application formats them, with the exact type of outputs your system expects. On that narrow slice, an 85–95% quality match to the teacher is achievable. Outside that slice, the student may fail badly.
That distinction determines when distillation is viable: it requires a stable, well-defined task. If your query distribution shifts substantially over time — if new products, new use cases, or new user populations change what people are asking — the student you trained last month will slowly drift out of alignment and degrade silently. Plan for periodic re-distillation if you ship it on a dynamic workload.
The distillation pipeline has five steps:
- Sample your production distribution. Collect 5,000–50,000 representative queries from your production logs. Deduplicate, clean, and hold out 10–15% as an evaluation set that will not be used for training.
- Generate teacher completions. Send each training query to your teacher model (Fable 5, Opus 4.8, or another frontier model) and collect the outputs. For reasoning tasks, use chain-of-thought prompting in the teacher — the student learns better reasoning patterns when the teacher's outputs include explicit reasoning steps rather than just final answers.
- Fine-tune the student with LoRA. Train a small open-weight model (Qwen 3.5, Llama 4, Mistral, or similar) on the teacher-labelled data using supervised fine-tuning and LoRA for parameter-efficient training. Tools: HuggingFace
transformers+peft, or Unsloth for faster iteration on a single GPU. - Evaluate on the held-out set. Compare the student's outputs to the teacher's outputs on queries neither saw during training. Measure task-specific quality metrics, not just perplexity.
- Gate by quality threshold. Only route traffic to the student if it passes your quality bar on the held-out eval. Do not gate by a single aggregate metric — break the eval down by query sub-type so you know where the student underperforms.
Here is a minimal teacher sampling script and a LoRA fine-tuning sketch using PEFT.
# Step 1 + 2: Sample production queries and generate teacher completions
import anthropic
import json
client = anthropic.Anthropic()
def generate_teacher_completions(queries: list[str], output_path: str) -> None:
"""
Generate teacher completions for a list of queries.
Uses chain-of-thought prompting so the student learns reasoning steps.
"""
results = []
for q in queries:
response = client.messages.create(
model="claude-fable-5", # teacher
max_tokens=2048,
messages=[
{
"role": "user",
"content": (
f"Think step-by-step before giving your final answer.\n\n{q}"
),
}
],
)
results.append({"query": q, "completion": response.content[0].text})
with open(output_path, "w") as f:
json.dump(results, f, indent=2)
print(f"Saved {len(results)} teacher completions to {output_path}")
# Step 3: LoRA fine-tuning with PEFT (HuggingFace)
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import json
# Load teacher-labelled data
with open("teacher_completions.json") as f:
raw = json.load(f)
data = [
{"text": f"### Query:\n{ex['query']}\n\n### Response:\n{ex['completion']}"}
for ex in raw
]
dataset = Dataset.from_list(data)
# Load student model (replace with your chosen open-weight model)
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", load_in_4bit=True)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
training_args = TrainingArguments(
output_dir="./student-distilled",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
logging_steps=50,
save_strategy="epoch",
fp16=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
trainer.save_model("./student-distilled/final")
Once the student passes your quality gate, the inference cost story changes fundamentally. A 7B model running on a single GPU instance serves hundreds of requests per second; the API cost per request drops to infrastructure cost only. Compared to Fable 5 at $10/$50 per MTok, a self-hosted 7B model typically runs at 10–100× lower cost per request depending on your GPU utilisation rate and the length of your inputs and outputs.
Account for the full cost picture before claiming the win: the fine-tuning run itself costs GPU time (typically a few hundred dollars for a well-equipped A100 run on a 7B model with 20K examples); model hosting requires a persistent GPU instance with the associated fixed cost; and re-distillation every few months adds ongoing engineering overhead. Distillation is the right choice when you are serving well above 100,000 requests per day on a task stable enough to justify the investment. Below that scale, routing and semantic caching are almost always the better ROI.
For a deeper treatment of distillation pipelines and student model selection, see our full distillation guide.
Building cost-efficient LLM infrastructure? Verified Builders get found.
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free and takes two minutes.
Become a Verified Builder →Stacking All Three — A Real Cost Model
The true power of these three levers comes from applying them in sequence, because each operates on a smaller bill than the previous one produced. The table below shows what that compounding looks like on a realistic scale: 1 million requests per day at 500 tokens average input, all initially routed to Claude Fable 5.
| Stage | What changes | Daily cost | Monthly cost | Cumulative saving |
|---|---|---|---|---|
| Baseline | All 1M requests to Fable 5 at $10/$50 per MTok | ~$7,500 | ~$225,000 | — |
| + Routing | 70% to Haiku, 20% to Sonnet, 10% to Fable 5 | ~$1,500 | ~$45,000 | 80% |
| + Semantic cache | 40% cache hit rate on remaining API calls | ~$900 | ~$27,000 | 88% |
| + Distillation | Replace Haiku API calls with self-hosted student model | ~$400 | ~$12,000 | 95% |
These figures are illustrative. The 100-unit index and every percentage are based on the pricing and assumptions above — actual savings depend on your query distribution, cache hit rate, output length, task complexity, and GPU hosting costs for the distilled student. The $400/day distillation stage assumes GPU hosting of roughly $2,000/month for a well-utilised A100 instance serving the Haiku-equivalent traffic.
Reading the table down the saving column makes the compounding plain. Routing produces the largest single jump — 80% in this specific traffic mix — because the expensive frontier model now sees only 10% of traffic. In practice, routing savings range from 50–70% depending on your query distribution; this example uses an optimistic 70/20/10 split. Semantic caching adds another 8 percentage points on top: 40% of the remaining API calls hit the cache and return without touching a model. Distillation then moves the Haiku-routed 70% of traffic off the API entirely, converting a variable per-token cost into a fixed infrastructure cost that is substantially lower at this traffic volume.
The implementation timeline matters too: routing is a week of work, semantic caching is two to three weeks, and distillation is four to eight weeks for the first run. If you are trying to reduce a painful bill quickly, ship routing first — it delivers the biggest absolute saving and takes the least time. The others compound on the smaller bill that routing produces, which makes them easier to prioritise against other engineering work.
For the broader context of why these numbers matter at the infrastructure level, see our FinOps playbook on AI inference cost economics and the companion piece on building profitable AI products in 2026.
For teams in India, GPU hosting for distillation is most cost-effective on AWS Mumbai (ap-south-1) or through Neysa's domestic GPU cloud, where spot A100 instances are priced competitively in INR and avoid cross-border data transfer costs. For UK teams, Azure UK South offers H100 on-demand at prices competitive with US East; reserved 1-year pricing cuts this further by roughly 35%. In both markets, the routing and semantic caching levers pay off immediately at API prices, while distillation's ROI calculation should include the regional GPU hosting cost rather than a US-centric default.
What to Measure — Metrics That Catch Savings Erosion
The three levers are not one-time migrations. Each of them requires ongoing measurement to confirm that the savings you expected are materialising and that the quality you promised your users is holding. The metrics below are the minimum instrumentation for a system that uses all three techniques.
Cost per successful request is the north star metric. Not cost per token, not cost per API call — cost per request that returned an answer your application accepted. Token costs can drop while successful-request costs rise if the routing or quality gate starts failing more often and triggering expensive retries on stronger models. Track this metric at the tier level (budget / mid / frontier / cached / distilled) so you can see where erosion is happening.
Cache hit rate is the operational heartbeat of your semantic cache. If it drops below your expected range without a corresponding change in traffic patterns, the most likely cause is document updates that are not being properly invalidated. Set an alert if the cache hit rate falls more than 10 percentage points below its rolling 7-day average. A sudden drop often means a document hash changed without triggering invalidation, so stale entries are being bypassed for correct new answers.
Student model quality vs teacher on production distribution requires periodic blind evaluation rather than a fixed benchmark. Once a month, sample 500–1,000 production queries that have been routed to the student, regenerate answers from the teacher, and evaluate both side-by-side on your task-specific quality metric. A quality gap that is growing over time signals that the production distribution has drifted away from the training distribution and re-distillation is overdue.
Model routing accuracy measures whether queries are landing in the right tier. Run a shadow evaluation: take a stratified sample of routed queries, run each one on the model it was routed to and on the next tier up, and check whether the cheaper model's answer was genuinely adequate. If the tier-match rate is falling — if you are seeing more cases where the budget-tier answer needed escalation — your classifier needs retraining or your thresholds need adjustment.
For the observability infrastructure to collect and alert on these metrics, see the agent observability guide, which covers OpenTelemetry instrumentation for LLM pipelines.
Common Pitfalls and How to Avoid Them
Each of the three levers has a characteristic failure mode, and all of them share one dangerous property: they fail quietly. The bill drops, the dashboard looks clean, and a few days later users start complaining about wrong answers or degraded quality. Understanding the failure modes in advance is the difference between a safe optimisation and a production incident.
Semantic cache serving stale answers. This is the most common failure and it happens when cached responses outlive the data that made them accurate. A user asks about a policy that changed last week; the cache returns the answer you stored three weeks ago because the query is semantically similar; the user acts on outdated information. The fix is cache invalidation by document hash: every cached entry must be associated with the hash or version identifier of the document(s) that informed it, and invalidation must be triggered atomically when those documents change. If you are building on a knowledge base with frequent updates and you cannot instrument document-level invalidation, either use a short TTL or do not use semantic caching for that knowledge base.
Distillation student memorising style instead of reasoning. A student model trained on bare final-answer completions from the teacher will learn to produce text that looks like the teacher's answers without learning how to arrive at them correctly. This produces a model that sounds confident and well-written but makes systematic errors on query types that require reasoning rather than pattern-matching. The fix is always to include explicit reasoning steps — chain-of-thought — in the teacher's training outputs for any task that involves logic, multi-step extraction, or structured decision-making. The additional training data cost is worth it.
Over-routing to cheap models. Routing is seductive because the savings are immediate and large. The temptation, once the budget tier is handling 70% of traffic successfully, is to push it to 80% or 90% to squeeze more out of the bill. Past the quality threshold of the budget model, it starts returning plausible-sounding wrong answers — not errors that your monitoring detects, but incorrect answers that reach users. Guard against this by never widening the budget tier without running the quality gate first on a held-out set that represents the new traffic you are proposing to route there. An escalation path — where a low-confidence budget-tier answer is automatically retried on the next tier — adds safety at relatively low extra cost.
Missing the GPU cost in the distillation ROI calculation. The API saving from distillation is large and visible in your billing dashboard. The infrastructure cost of running the student model is a separate line item — a persistent GPU instance — and it is fixed rather than variable. At low traffic volumes, the fixed GPU cost can exceed the variable API cost you are avoiding, making distillation a negative-ROI decision despite the per-request saving. Always model the full cost, including the fine-tuning run, the GPU instance cost, and the engineering overhead of re-distillation cycles, against the actual API cost you are replacing. The break-even point is typically around 100,000 requests per day for a well-optimised setup; below that, routing and semantic caching almost always have better ROI.