Why most teams overspend 3–5× on LLM calls
The uncomfortable truth about most production LLM bills is that they are several times larger than they need to be, and the people paying them often cannot say why. As of June 2026, enterprise spend on model APIs has become a serious line item — it passed $8.4 billion across 2025 and is projected higher through 2026 — yet most teams still have no systematic cost strategy. They picked a strong default model during the prototype, wired it into every code path, and shipped. The bill that followed was treated as the cost of doing business rather than as something an engineer could halve in an afternoon.
The overspend comes from three habits, and they map cleanly onto the three pillars of this playbook. The first is resending the same context on every call — the same system prompt, the same tool schemas, the same retrieved documents — and paying full price to process tokens the provider has already seen. The second is routing everything to a frontier model, so a one-line classification that a budget model handles perfectly is billed at the same rate as a hard multi-step reasoning task. The third is sending bloated prompts stuffed with boilerplate, verbose few-shot examples and irrelevant retrieved chunks, every token of which you pay for whether the model needed it or not.
Each of these is fixable, and the fixes stack. A team that addresses only one typically sees a 40–70% reduction; a team that addresses all three consistently reaches the 70–90% range that this article promises in its title. Whether you run on AWS Mumbai or London region, whether you are an Indian fintech scoring transactions or a UK SaaS summarising support tickets, the levers are the same and the maths is the same. The rest of this guide walks through each pillar with code, builds a worked cost model, and ends with the failure modes that turn a clever optimisation into an outage.
The three pillars: Cache, Route, Compress
The 2026 cost-optimisation framework is deliberately small, because a framework you can hold in your head is one you will actually apply. It has three pillars, and they are ordered by how much they typically save and how safely they save it.
Cache removes work you have already paid for. Most production traffic is repetitive — the same system prompt on every request, the same tools described to an agent on every step of a loop, the same FAQ answered a thousand times a day. Caching, at both the prompt level and the response level, lets you stop paying full price for that repetition. It is the safest pillar because, done correctly, it changes the bill without changing the answer.
Route sends each query to the cheapest model that can handle it. The insight is that difficulty is not uniform: a large share of real traffic is simple, and simple traffic does not need a frontier model. A small classifier plus a model registry turns a flat "everything goes to the expensive model" policy into a graded one, where budget models absorb the easy majority and frontier models are reserved for the genuinely hard minority.
Compress shrinks what you send. After caching and routing have removed repeated and over-served work, compression trims the tokens that remain — collapsing boilerplate, pruning irrelevant retrieved context, batching small jobs together. It comes last because it carries the most quality risk: cut too far and answers degrade. Applied carefully and measured, it is the final multiplier on an already much smaller bill.
The pillars are independent — you can ship them one at a time — but they compound. Because each operates on the bill the previous one produced, a 60% cut from routing followed by a 50% cut from caching is not a 110% cut; it is a 60% cut and then half of the remainder again, which is how the headline range reaches 70–90%. We have covered the same three-lever structure from a news angle in our explainer on how to cut your LLM API bill 70–85% in 2026, which is a useful companion to the hands-on detail below.
Pillar 1 — Caching: prompt-level and semantic
Caching is the first pillar because it is the safest and often the fastest to ship. It works at two distinct levels, and a mature setup uses both.
The first level is prompt-level caching, which is provider-native. As of June 2026, Anthropic, OpenAI and Google all offer prompt caching: the provider stores the processed form of a prefix of your prompt, and on the next request that shares that prefix it skips reprocessing those tokens and bills them at a steep discount. The discipline is structural — order your prompt so the large, static content comes first. Put the system prompt and the tool schemas at the very front, mark a cache breakpoint after them, and keep the variable user turn at the end. For an agent loop or a long chat session that resends the same instructions and tools on every step, this is transformative, because that unchanging prefix is exactly what gets cached.
Here is the pattern with the Anthropic SDK — static system prompt and tool schemas first, with an explicit cache breakpoint. (Code stays in US English, as is conventional.)
from anthropic import Anthropic
client = Anthropic()
# Large, static content goes FIRST so it can be cached across calls.
SYSTEM_PROMPT = [
{
"type": "text",
"text": LONG_STATIC_INSTRUCTIONS, # policies, style guide, schema docs
# Mark a cache breakpoint: everything up to here is cached and
# billed at the cheaper cached-input rate on subsequent calls.
"cache_control": {"type": "ephemeral"},
},
]
TOOLS = [
# Tool schemas are static too — cache them with the system prompt.
{"name": "search_docs", "description": "...", "input_schema": {...}},
{"name": "run_query", "description": "...", "input_schema": {...}},
]
def ask(user_question: str):
return client.messages.create(
model="claude-haiku", # routed model id; see Pillar 2
max_tokens=1024,
system=SYSTEM_PROMPT, # cached prefix
tools=TOOLS, # cached prefix
messages=[
# Only the variable part changes per request -> not cached.
{"role": "user", "content": user_question},
],
)
The second level is semantic caching, which lives in your application rather than the provider. Instead of matching an exact prefix, it embeds the incoming query, looks for a semantically similar query you have already answered, and — above a similarity threshold — returns the stored response without calling the model at all. For workloads with recurring questions (a support assistant, an internal knowledge bot, an FAQ over policy documents), semantic caching can cut LLM costs up to roughly 68.8% in typical production workloads, and it improves latency at the same time because a cache hit skips the model entirely.
The table below shows the kind of savings each caching level typically delivers. Treat these as ranges that depend on how repetitive your traffic is.
| Caching level | Where it lives | Typical saving | Best for |
|---|---|---|---|
| Prompt-level (cached input tokens) | Provider-native (Anthropic / OpenAI / Google) | 50–90% cheaper on the cached prefix | Agent loops and chat sessions that resend system prompts and tool schemas |
| Semantic (response-level) | Your application layer | 40–80% overall cost cut (up to ~68.8% in typical workloads) | Recurring or near-duplicate questions; FAQ and support traffic |
Pricing as of 2026-06. Provider discounts for cached input tokens and exact cache-retention windows change over time — confirm current rates with each provider before modelling your own numbers.
Order matters more than anything else in prompt caching. The cache only helps for an unchanged prefix, so a single dynamic token near the front — a timestamp, a request ID, a freshly shuffled few-shot order — invalidates everything after it. Keep all static content (system prompt, tool schemas, long policy text) at the very top, push every variable element to the bottom, and your cache hit rate across an agent loop will climb dramatically. We go deeper on the cross-provider mechanics in our guide to prompt caching across Anthropic, OpenAI and Gemini.
Pillar 2 — Model routing: budget → mid → frontier
Routing is usually the single biggest contributor to the headline saving, because it attacks the most common and most expensive mistake: paying frontier rates for traffic that a budget model handles perfectly. The principle is simple — send each query to the cheapest model that can answer it well — but it needs two moving parts to work safely: a complexity classifier that scores each incoming query, and a model registry that maps that score to a cost-appropriate model.
The classifier does not need to be clever to be useful. A cheap, fast heuristic or a small model that returns a difficulty score captures most of the value. Here is a deliberately simple version: a length-and-signal heuristic that buckets queries into three tiers, paired with a registry that maps each tier to a model.
from dataclasses import dataclass
# --- Model registry: complexity tier -> cost-appropriate model ---
@dataclass(frozen=True)
class Tier:
name: str
model: str
REGISTRY = {
"small": Tier("budget", "claude-haiku"), # cheap, fast
"medium": Tier("mid", "claude-sonnet"), # balanced
"large": Tier("frontier", "claude-opus"), # most capable, priciest
}
REASONING_SIGNALS = ("prove", "step by step", "analyze", "trade-off",
"design", "debug", "why", "compare")
def classify(query: str) -> str:
"""Return 'small' | 'medium' | 'large' for a query."""
q = query.lower()
tokens = len(q.split())
hits = sum(1 for s in REASONING_SIGNALS if s in q)
if tokens < 30 and hits == 0:
return "small" # short, no reasoning signal
if tokens > 200 or hits >= 2:
return "large" # long or clearly multi-step
return "medium"
def route(query: str) -> str:
tier = REGISTRY[classify(query)]
return tier.model # hand this id to your client call
In production you would replace the heuristic with a small classifier model fine-tuned on your own traffic, and you would add a confidence threshold so a low-confidence budget answer can be escalated to the next tier and retried. But even this naive version captures the core economics. The cost table below uses clearly-labelled illustrative tiers rather than vendor list prices, because exact prices move and vary by region.
| Tier | Relative cost per 1M tokens (illustrative) | Good for |
|---|---|---|
| Budget (small) | 1× (baseline) | Classification, extraction, short rewrites, routing decisions, simple FAQ |
| Mid (medium) | several × the budget tier | Summaries, multi-paragraph drafting, moderate reasoning, most chat |
| Frontier (large) | many × the budget tier | Hard multi-step reasoning, complex code, high-stakes analysis |
Pricing as of 2026-06. Figures are illustrative relative tiers, not vendor list prices; as of June 2026, frontier tiers run several × the price of budget tiers, and the exact multiples differ by provider, model and region. Confirm current rates before modelling.
The economics follow directly from your traffic mix. Suppose, as is common, that 60% of queries are simple, 30% are medium and 10% are genuinely hard. Routing the 60% to a budget tier and the 30% to a mid tier — instead of sending all 100% to a frontier model — typically drops the average cost per query by 50–70% on its own, before caching or compression touch the bill. That is the single largest lever most teams have, and it requires no change to answer quality on the hard 10% that still goes to the frontier model. For a structured way to choose the default model each tier points at, our piece on picking a default model in 2026 by cost-per-task is a good reference, and the broader trajectory of inference pricing is covered in Nvidia Rubin's 10× inference cost cut.
A routing layer fails silently. When the budget model gets a query it cannot really handle, it does not throw an error — it returns a confident, plausible, wrong answer, and your bill looks great while your quality quietly erodes. Never ship routing without an evaluation harness that compares routed answers against a frontier-only baseline on a held-out set, plus an escalation path so low-confidence cheap answers are retried on a stronger model. Optimise the bill only as fast as your evals prove the quality holds.
Pillar 3 — Compression: prompt trimming and batching
With caching and routing in place, the third pillar trims the tokens that remain. Compression is the highest-risk pillar — cut information the model actually needs and answers degrade — so it comes last and is measured most carefully. Applied to genuine redundancy, though, it is nearly free quality-wise and meaningfully reduces both input and output spend.
The techniques fall into a few families. Most are about removing tokens you are paying for but not benefiting from; batching is about amortising fixed per-call overhead across many items. The table below summarises the main ones and the typical reduction each delivers on the tokens it touches.
| Technique | What it removes | Typical token reduction |
|---|---|---|
| Boilerplate trimming | Repeated verbose instructions; restating context the model retains | 10–30% of input |
| Few-shot pruning | Surplus examples beyond the one or two that actually steer behaviour | 20–50% of example tokens |
| Retrieval re-ranking / filtering | Low-relevance retrieved chunks padded into RAG context | 30–60% of context |
| Output capping & format control | Rambling responses; enforce concise or structured output | 20–50% of output |
| Request batching | Per-call fixed overhead across many small jobs | Amortises overhead; raises throughput per call |
Two of these deserve a note. Retrieval filtering is the highest-yield compression in most RAG systems, because the lazy default — stuff the top-k chunks into the prompt and hope — pays for a great deal of context that does not bear on the question. A re-ranking pass that keeps only the chunks above a relevance threshold often halves the context tokens with no quality loss, and frequently improves the answer by reducing distraction. Batching is the one technique here that is about throughput rather than per-token price: grouping many small extraction or classification jobs into fewer calls amortises the fixed overhead and, where a provider offers a batch tier, can unlock a further discount. The layered view of how these stack with caching is set out in our explainer on cutting LLM API costs 70–90% with layered caching.
Putting it together: a worked cost model
The pillars are most convincing when you watch them stack on a single workload. Consider an illustrative service handling 1,000,000 requests per month — a UK SaaS support assistant, say, or an Indian fintech document-classification pipeline. Before any optimisation, every request goes to a frontier model, nothing is cached, and prompts carry full boilerplate and unfiltered context. We will index the baseline monthly cost at 100 units and apply each pillar in turn.
| Stage | What changes | Effect on remaining bill | Relative monthly cost (illustrative) |
|---|---|---|---|
| Baseline | All traffic to frontier; no cache; full prompts | — | 100 |
| + Routing | 60% budget / 30% mid / 10% frontier | ~60% lower average cost per query | 40 |
| + Caching | Prompt-level prefix cache + semantic cache on recurring queries | ~50% off the routed bill | 20 |
| + Compression | Boilerplate trimmed, retrieval filtered, output capped, batching | ~25% off the remaining bill | 15 |
Pricing as of 2026-06. The 100-unit index and every percentage here are illustrative and chosen to show how the pillars compound, not to predict your exact bill. Your mix of traffic difficulty, repetition and prompt bloat determines where in the 70–90% band you land.
Read down the right-hand column and the compounding is plain. Routing alone takes the bill from 100 to 40 — a 60% cut — because the expensive frontier model now sees only the hard 10% of traffic instead of all of it. Caching then removes roughly half of that reduced bill, not half of the original, taking 40 to 20: the agent loops and recurring questions stop paying full price for repeated context. Compression shaves another quarter off what is left, 20 down to 15. The end state is a bill that is 85% smaller than where it started — squarely in the 70–90% range that teams who implement all three layers consistently report in production.
The order is deliberate. Routing first means caching and compression operate on a workload that is already sending the right traffic to the right model, so their savings are clean rather than masked by frontier overspend. Caching second removes the repeated work before you spend effort trimming prompts that may not even be hitting the model on a cache hit. Compression last, on the smallest possible surface, keeps its quality risk contained. For the wider economic context of why these numbers matter so much in 2026, our FinOps playbook on AI inference cost economics sets out the spend trajectory that makes this engineering non-optional.
Pitfalls: poisoning, over-routing and eval drift
Each pillar has a characteristic failure mode, and all three are the kind that look fine on the dashboard until they do not.
Cache poisoning and staleness. A semantic cache that returns a stored answer for a query that is similar but not equivalent will serve a wrong answer cheaply and confidently. The classic case is a query that shares wording but differs in a decisive detail — a different account, a different date, a negation — yet lands above your similarity threshold. The defences are a conservative threshold, cache keys that include the variables that change the answer, and a time-to-live so stale answers expire. For prompt-level caching, the risk is subtler: a cached prefix that includes content that should have varied. Treat the cache as part of your correctness surface, not just your cost surface.
Over-routing. The temptation, once routing works, is to push ever more traffic to the budget tier to chase the bill down. Past a point the budget model starts failing silently on queries that were marginally too hard, and because it returns a plausible answer rather than an error, nobody notices until users complain. Guard against it by widening the budget tier only as fast as your held-out evaluation proves quality holds, and by keeping an escalation path that retries low-confidence answers on a stronger model. A 5% quality regression can cost far more than the API saving that caused it.
Eval drift. All of the above depends on an evaluation set that reflects real traffic — and real traffic moves. The query mix that justified routing 60% to the budget tier in one quarter may shift as your product changes, your users change, or the world changes. An eval set that is never refreshed slowly stops measuring reality, and your cost decisions drift out from under it. Re-sample production traffic into your eval set on a schedule, re-run the routing and compression evaluations against it, and treat a falling cheap-tier pass rate as a prompt to re-tune the thresholds. The cost playbook is not a one-time migration; it is a control loop you keep running.
Conclusion and next steps
The headline is achievable and the path is not mysterious: cache what repeats, route each query to the cheapest capable model, compress what you send, and measure all three against a held-out evaluation. Ship them in that order — routing for the biggest single win, caching for the safe compounding saving, compression last and carefully — and a workload that started at full frontier price lands 70–90% cheaper without a meaningful quality cost. Start with one pillar this week; instrument it; then add the next.
If you are the engineer who took a feature from a frightening monthly bill to a fraction of it, that is exactly the kind of shipped, measurable work that the people hiring and collaborating in AI want to see. A Verified Builder profile on AI Tech Connect is where you put it — a single page that shows what you built and what it saved, in front of an audience that understands why it matters. For more on framing shipped work for a portfolio, see the AI engineer portfolio that gets you hired.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →