What is the single biggest lever for cutting LLM costs?

There is no single lever that does all the work — the 70–90% reductions come from stacking three. That said, model routing is usually the biggest single contributor, because most production traffic is simple and a large share of queries do not need a frontier model. If 60% of queries go to a budget tier, 30% to a mid tier and 10% to a frontier tier, the average cost per query typically drops 50–70% on its own. Caching then layers on top, removing repeated work in agent loops and recurring prompts.

Does prompt caching work across providers?

Yes. As of June 2026, Anthropic, OpenAI and Google all offer prompt caching for their models, and the core technique is the same everywhere: put the large static content — system prompt and tool schemas — at the very front of the request so it can be cached, and keep the variable user content at the end. Cached input tokens are typically 50–90% cheaper than uncached input, so the savings are largest for agent loops and chat sessions that resend the same context many times. The exact API surface and discount differ by provider, so confirm each one's current pricing before you model the numbers.

How do I route between models safely without quality regressions?

Routing safely needs two parts: a complexity classifier that scores each query, and a model registry that maps the score to the cheapest capable model. The safety comes from instrumentation, not optimism. Run a shadow evaluation that compares the routed answer against a frontier-only baseline on a held-out set, watch the rate at which cheap-tier answers are escalated or corrected, and always allow an escalation path so a low-confidence budget-tier answer is retried on a stronger model. Start conservative — route only the obviously simple traffic to the budget tier — and widen the budget tier as your evaluation data proves it is safe.

Will prompt compression hurt answer quality?

It can, if you compress blindly, which is why compression is the lever to apply last and to measure most carefully. Safe compression removes genuine redundancy — boilerplate instructions repeated every call, verbose few-shot examples that can be trimmed, and retrieved context that is not relevant to the question. Aggressive compression that strips information the model actually needs will quietly degrade answers. Always run a held-out eval before and after, and treat any drop in your quality metric as a signal that you have compressed past the safe point.

What total cost saving is realistic in production?

Teams that implement all three layers — caching, routing and compression — consistently and measure them properly report 70–90% reductions in production LLM spend. The exact figure depends on the workload: high-repetition agent and chat workloads benefit most from caching, traffic with a wide spread of difficulty benefits most from routing, and verbose prompts benefit most from compression. A single lever in isolation tends to deliver 40–70%; the top of the range comes from stacking all three so each compounds on the smaller bill the previous one produced.

The Cache, Route, Compress Playbook: Cutting LLM Costs 70–90% in Production

Why most teams overspend 3–5× on LLM calls

The uncomfortable truth about most production LLM bills is that they are several times larger than they need to be, and the people paying them often cannot say why. As of June 2026, enterprise spend on model APIs has become a serious line item — it passed $8.4 billion across 2025 and is projected higher through 2026 — yet most teams still have no systematic cost strategy. They picked a strong default model during the prototype, wired it into every code path, and shipped. The bill that followed was treated as the cost of doing business rather than as something an engineer could halve in an afternoon.

The overspend comes from three habits, and they map cleanly onto the three pillars of this playbook. The first is resending the same context on every call — the same system prompt, the same tool schemas, the same retrieved documents — and paying full price to process tokens the provider has already seen. The second is routing everything to a frontier model, so a one-line classification that a budget model handles perfectly is billed at the same rate as a hard multi-step reasoning task. The third is sending bloated prompts stuffed with boilerplate, verbose few-shot examples and irrelevant retrieved chunks, every token of which you pay for whether the model needed it or not.

Each of these is fixable, and the fixes stack. A team that addresses only one typically sees a 40–70% reduction; a team that addresses all three consistently reaches the 70–90% range that this article promises in its title. Whether you run on AWS Mumbai or London region, whether you are an Indian fintech scoring transactions or a UK SaaS summarising support tickets, the levers are the same and the maths is the same. The rest of this guide walks through each pillar with code, builds a worked cost model, and ends with the failure modes that turn a clever optimisation into an outage.

The three pillars: Cache, Route, Compress

The 2026 cost-optimisation framework is deliberately small, because a framework you can hold in your head is one you will actually apply. It has three pillars, and they are ordered by how much they typically save and how safely they save it.

Cache removes work you have already paid for. Most production traffic is repetitive — the same system prompt on every request, the same tools described to an agent on every step of a loop, the same FAQ answered a thousand times a day. Caching, at both the prompt level and the response level, lets you stop paying full price for that repetition. It is the safest pillar because, done correctly, it changes the bill without changing the answer.

Route sends each query to the cheapest model that can handle it. The insight is that difficulty is not uniform: a large share of real traffic is simple, and simple traffic does not need a frontier model. A small classifier plus a model registry turns a flat "everything goes to the expensive model" policy into a graded one, where budget models absorb the easy majority and frontier models are reserved for the genuinely hard minority.

Compress shrinks what you send. After caching and routing have removed repeated and over-served work, compression trims the tokens that remain — collapsing boilerplate, pruning irrelevant retrieved context, batching small jobs together. It comes last because it carries the most quality risk: cut too far and answers degrade. Applied carefully and measured, it is the final multiplier on an already much smaller bill.

The pillars are independent — you can ship them one at a time — but they compound. Because each operates on the bill the previous one produced, a 60% cut from routing followed by a 50% cut from caching is not a 110% cut; it is a 60% cut and then half of the remainder again, which is how the headline range reaches 70–90%. We have covered the same three-lever structure from a news angle in our explainer on how to cut your LLM API bill 70–85% in 2026, which is a useful companion to the hands-on detail below.

Pillar 1 — Caching: prompt-level and semantic

Caching is the first pillar because it is the safest and often the fastest to ship. It works at two distinct levels, and a mature setup uses both.

The first level is prompt-level caching, which is provider-native. As of June 2026, Anthropic, OpenAI and Google all offer prompt caching: the provider stores the processed form of a prefix of your prompt, and on the next request that shares that prefix it skips reprocessing those tokens and bills them at a steep discount. The discipline is structural — order your prompt so the large, static content comes first. Put the system prompt and the tool schemas at the very front, mark a cache breakpoint after them, and keep the variable user turn at the end. For an agent loop or a long chat session that resends the same instructions and tools on every step, this is transformative, because that unchanging prefix is exactly what gets cached.

Here is the pattern with the Anthropic SDK — static system prompt and tool schemas first, with an explicit cache breakpoint. (Code stays in US English, as is conventional.)

from anthropic import Anthropic

client = Anthropic()

# Large, static content goes FIRST so it can be cached across calls.
SYSTEM_PROMPT = [
    {
        "type": "text",
        "text": LONG_STATIC_INSTRUCTIONS,   # policies, style guide, schema docs
        # Mark a cache breakpoint: everything up to here is cached and
        # billed at the cheaper cached-input rate on subsequent calls.
        "cache_control": {"type": "ephemeral"},
    },
]

TOOLS = [
    # Tool schemas are static too — cache them with the system prompt.
    {"name": "search_docs", "description": "...", "input_schema": {...}},
    {"name": "run_query",   "description": "...", "input_schema": {...}},
]

def ask(user_question: str):
    return client.messages.create(
        model="claude-haiku",          # routed model id; see Pillar 2
        max_tokens=1024,
        system=SYSTEM_PROMPT,           # cached prefix
        tools=TOOLS,                    # cached prefix
        messages=[
            # Only the variable part changes per request -> not cached.
            {"role": "user", "content": user_question},
        ],
    )

The second level is semantic caching, which lives in your application rather than the provider. Instead of matching an exact prefix, it embeds the incoming query, looks for a semantically similar query you have already answered, and — above a similarity threshold — returns the stored response without calling the model at all. For workloads with recurring questions (a support assistant, an internal knowledge bot, an FAQ over policy documents), semantic caching can cut LLM costs up to roughly 68.8% in typical production workloads, and it improves latency at the same time because a cache hit skips the model entirely.

The table below shows the kind of savings each caching level typically delivers. Treat these as ranges that depend on how repetitive your traffic is.

Caching level	Where it lives	Typical saving	Best for
Prompt-level (cached input tokens)	Provider-native (Anthropic / OpenAI / Google)	50–90% cheaper on the cached prefix	Agent loops and chat sessions that resend system prompts and tool schemas
Semantic (response-level)	Your application layer	40–80% overall cost cut (up to ~68.8% in typical workloads)	Recurring or near-duplicate questions; FAQ and support traffic

Pricing as of 2026-06. Provider discounts for cached input tokens and exact cache-retention windows change over time — confirm current rates with each provider before modelling your own numbers.

Pro tip

Order matters more than anything else in prompt caching. The cache only helps for an unchanged prefix, so a single dynamic token near the front — a timestamp, a request ID, a freshly shuffled few-shot order — invalidates everything after it. Keep all static content (system prompt, tool schemas, long policy text) at the very top, push every variable element to the bottom, and your cache hit rate across an agent loop will climb dramatically. We go deeper on the cross-provider mechanics in our guide to prompt caching across Anthropic, OpenAI and Gemini.

Pillar 2 — Model routing: budget → mid → frontier

Routing is usually the single biggest contributor to the headline saving, because it attacks the most common and most expensive mistake: paying frontier rates for traffic that a budget model handles perfectly. The principle is simple — send each query to the cheapest model that can answer it well — but it needs two moving parts to work safely: a complexity classifier that scores each incoming query, and a model registry that maps that score to a cost-appropriate model.

The classifier does not need to be clever to be useful. A cheap, fast heuristic or a small model that returns a difficulty score captures most of the value. Here is a deliberately simple version: a length-and-signal heuristic that buckets queries into three tiers, paired with a registry that maps each tier to a model.

from dataclasses import dataclass

# --- Model registry: complexity tier -> cost-appropriate model ---
@dataclass(frozen=True)
class Tier:
    name: str
    model: str

REGISTRY = {
    "small":  Tier("budget",   "claude-haiku"),    # cheap, fast
    "medium": Tier("mid",      "claude-sonnet"),   # balanced
    "large":  Tier("frontier", "claude-opus"),     # most capable, priciest
}

REASONING_SIGNALS = ("prove", "step by step", "analyze", "trade-off",
                     "design", "debug", "why", "compare")

def classify(query: str) -> str:
    """Return 'small' | 'medium' | 'large' for a query."""
    q = query.lower()
    tokens = len(q.split())
    hits = sum(1 for s in REASONING_SIGNALS if s in q)

    if tokens < 30 and hits == 0:
        return "small"            # short, no reasoning signal
    if tokens > 200 or hits >= 2:
        return "large"            # long or clearly multi-step
    return "medium"

def route(query: str) -> str:
    tier = REGISTRY[classify(query)]
    return tier.model             # hand this id to your client call

In production you would replace the heuristic with a small classifier model fine-tuned on your own traffic, and you would add a confidence threshold so a low-confidence budget answer can be escalated to the next tier and retried. But even this naive version captures the core economics. The cost table below uses clearly-labelled illustrative tiers rather than vendor list prices, because exact prices move and vary by region.

Tier	Relative cost per 1M tokens (illustrative)	Good for
Budget (small)	1× (baseline)	Classification, extraction, short rewrites, routing decisions, simple FAQ
Mid (medium)	several × the budget tier	Summaries, multi-paragraph drafting, moderate reasoning, most chat
Frontier (large)	many × the budget tier	Hard multi-step reasoning, complex code, high-stakes analysis

Pricing as of 2026-06. Figures are illustrative relative tiers, not vendor list prices; as of June 2026, frontier tiers run several × the price of budget tiers, and the exact multiples differ by provider, model and region. Confirm current rates before modelling.

The economics follow directly from your traffic mix. Suppose, as is common, that 60% of queries are simple, 30% are medium and 10% are genuinely hard. Routing the 60% to a budget tier and the 30% to a mid tier — instead of sending all 100% to a frontier model — typically drops the average cost per query by 50–70% on its own, before caching or compression touch the bill. That is the single largest lever most teams have, and it requires no change to answer quality on the hard 10% that still goes to the frontier model. For a structured way to choose the default model each tier points at, our piece on picking a default model in 2026 by cost-per-task is a good reference, and the broader trajectory of inference pricing is covered in Nvidia Rubin's 10× inference cost cut.

Watch out

A routing layer fails silently. When the budget model gets a query it cannot really handle, it does not throw an error — it returns a confident, plausible, wrong answer, and your bill looks great while your quality quietly erodes. Never ship routing without an evaluation harness that compares routed answers against a frontier-only baseline on a held-out set, plus an escalation path so low-confidence cheap answers are retried on a stronger model. Optimise the bill only as fast as your evals prove the quality holds.

Pillar 3 — Compression: prompt trimming and batching

With caching and routing in place, the third pillar trims the tokens that remain. Compression is the highest-risk pillar — cut information the model actually needs and answers degrade — so it comes last and is measured most carefully. Applied to genuine redundancy, though, it is nearly free quality-wise and meaningfully reduces both input and output spend.

The techniques fall into a few families. Most are about removing tokens you are paying for but not benefiting from; batching is about amortising fixed per-call overhead across many items. The table below summarises the main ones and the typical reduction each delivers on the tokens it touches.

Technique	What it removes	Typical token reduction
Boilerplate trimming	Repeated verbose instructions; restating context the model retains	10–30% of input
Few-shot pruning	Surplus examples beyond the one or two that actually steer behaviour	20–50% of example tokens
Retrieval re-ranking / filtering	Low-relevance retrieved chunks padded into RAG context	30–60% of context
Output capping & format control	Rambling responses; enforce concise or structured output	20–50% of output
Request batching	Per-call fixed overhead across many small jobs	Amortises overhead; raises throughput per call

Two of these deserve a note. Retrieval filtering is the highest-yield compression in most RAG systems, because the lazy default — stuff the top-k chunks into the prompt and hope — pays for a great deal of context that does not bear on the question. A re-ranking pass that keeps only the chunks above a relevance threshold often halves the context tokens with no quality loss, and frequently improves the answer by reducing distraction. Batching is the one technique here that is about throughput rather than per-token price: grouping many small extraction or classification jobs into fewer calls amortises the fixed overhead and, where a provider offers a batch tier, can unlock a further discount. The layered view of how these stack with caching is set out in our explainer on cutting LLM API costs 70–90% with layered caching.

Putting it together: a worked cost model

The pillars are most convincing when you watch them stack on a single workload. Consider an illustrative service handling 1,000,000 requests per month — a UK SaaS support assistant, say, or an Indian fintech document-classification pipeline. Before any optimisation, every request goes to a frontier model, nothing is cached, and prompts carry full boilerplate and unfiltered context. We will index the baseline monthly cost at 100 units and apply each pillar in turn.

Stage	What changes	Effect on remaining bill	Relative monthly cost (illustrative)
Baseline	All traffic to frontier; no cache; full prompts	—	100
+ Routing	60% budget / 30% mid / 10% frontier	~60% lower average cost per query	40
+ Caching	Prompt-level prefix cache + semantic cache on recurring queries	~50% off the routed bill	20
+ Compression	Boilerplate trimmed, retrieval filtered, output capped, batching	~25% off the remaining bill	15

Pricing as of 2026-06. The 100-unit index and every percentage here are illustrative and chosen to show how the pillars compound, not to predict your exact bill. Your mix of traffic difficulty, repetition and prompt bloat determines where in the 70–90% band you land.

Read down the right-hand column and the compounding is plain. Routing alone takes the bill from 100 to 40 — a 60% cut — because the expensive frontier model now sees only the hard 10% of traffic instead of all of it. Caching then removes roughly half of that reduced bill, not half of the original, taking 40 to 20: the agent loops and recurring questions stop paying full price for repeated context. Compression shaves another quarter off what is left, 20 down to 15. The end state is a bill that is 85% smaller than where it started — squarely in the 70–90% range that teams who implement all three layers consistently report in production.

The order is deliberate. Routing first means caching and compression operate on a workload that is already sending the right traffic to the right model, so their savings are clean rather than masked by frontier overspend. Caching second removes the repeated work before you spend effort trimming prompts that may not even be hitting the model on a cache hit. Compression last, on the smallest possible surface, keeps its quality risk contained. For the wider economic context of why these numbers matter so much in 2026, our FinOps playbook on AI inference cost economics sets out the spend trajectory that makes this engineering non-optional.

Pitfalls: poisoning, over-routing and eval drift

Each pillar has a characteristic failure mode, and all three are the kind that look fine on the dashboard until they do not.

Cache poisoning and staleness. A semantic cache that returns a stored answer for a query that is similar but not equivalent will serve a wrong answer cheaply and confidently. The classic case is a query that shares wording but differs in a decisive detail — a different account, a different date, a negation — yet lands above your similarity threshold. The defences are a conservative threshold, cache keys that include the variables that change the answer, and a time-to-live so stale answers expire. For prompt-level caching, the risk is subtler: a cached prefix that includes content that should have varied. Treat the cache as part of your correctness surface, not just your cost surface.

Over-routing. The temptation, once routing works, is to push ever more traffic to the budget tier to chase the bill down. Past a point the budget model starts failing silently on queries that were marginally too hard, and because it returns a plausible answer rather than an error, nobody notices until users complain. Guard against it by widening the budget tier only as fast as your held-out evaluation proves quality holds, and by keeping an escalation path that retries low-confidence answers on a stronger model. A 5% quality regression can cost far more than the API saving that caused it.

Eval drift. All of the above depends on an evaluation set that reflects real traffic — and real traffic moves. The query mix that justified routing 60% to the budget tier in one quarter may shift as your product changes, your users change, or the world changes. An eval set that is never refreshed slowly stops measuring reality, and your cost decisions drift out from under it. Re-sample production traffic into your eval set on a schedule, re-run the routing and compression evaluations against it, and treat a falling cheap-tier pass rate as a prompt to re-tune the thresholds. The cost playbook is not a one-time migration; it is a control loop you keep running.

Conclusion and next steps

The headline is achievable and the path is not mysterious: cache what repeats, route each query to the cheapest capable model, compress what you send, and measure all three against a held-out evaluation. Ship them in that order — routing for the biggest single win, caching for the safe compounding saving, compression last and carefully — and a workload that started at full frontier price lands 70–90% cheaper without a meaningful quality cost. Start with one pillar this week; instrument it; then add the next.

If you are the engineer who took a feature from a frightening monthly bill to a fraction of it, that is exactly the kind of shipped, measurable work that the people hiring and collaborating in AI want to see. A Verified Builder profile on AI Tech Connect is where you put it — a single page that shows what you built and what it saved, in front of an audience that understands why it matters. For more on framing shipped work for a portfolio, see the AI engineer portfolio that gets you hired.

Every article here is written by a Verified Builder. Want your name on the next one?

AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.

Become a Verified Builder →