What you need to know

  • Three levers do most of the work. Prompt caching, model routing with cascades, and batch processing each attack a different part of the bill. Stacked sensibly they reach a 50 to 70% reduction for typical workloads.
  • Caching is about reuse, not magic. A cache read costs roughly a tenth of the base input rate, but the first write costs a premium. It only pays off when the same prefix is reused several times within the time-to-live.
  • Routing means cheap-first, escalate-on-failure. Send each request to the cheapest tier that can plausibly handle it, validate the output deterministically, and escalate only when the validator fails.
  • Batching halves the rest. For anything that does not need an answer this second, asynchronous batch processing is roughly 50% cheaper, and it combines with caching.
  • The method outlives the prices. As of June 2026 the numbers below are accurate, but they will drift. Optimise the method, not the rate card.
Pro tip

Before you optimise anything, instrument it. You cannot cut a bill you cannot see. Log input tokens, output tokens, cache-read tokens and the model used on every call, tagged by feature. Most teams discover that 80% of their spend comes from two or three endpoints — and that is where caching and routing earn their keep.

Where LLM spend actually goes

Before reaching for tactics, it helps to understand the shape of a typical bill. Almost every production LLM workload spends money in three places, and they are not equal.

  • Repeated input tokens. The same system prompt, the same few-shot examples, the same retrieved documents, sent again and again on every request. This is usually the single largest and most wasteful line item — you are paying full price to re-process bytes that have not changed.
  • Over-powered model choice. Routing every request to a frontier model regardless of difficulty. A sentiment classification and a multi-file refactor do not need the same horsepower, yet many systems pay frontier rates for both.
  • Output tokens. Output is typically priced at three to five times the input rate, so verbose responses cost disproportionately. Tightening prompts to ask for concise output is the cheapest optimisation of all.

The good news is that the first two are structural, and structure is fixable. Caching attacks repeated input. Routing attacks model choice. Batching then takes a cut off whatever remains. None of these require a model change or a rewrite of your product — they are infrastructure decisions you bolt on around the call site.

To make the savings concrete we will anchor on published rates. The figures here use the Claude API as a reference point because its tiering is representative of the wider market; the same arithmetic applies to any provider with comparable input, output and cache pricing.

Model tier (as of June 2026) Input $/MTok Output $/MTok Cached read $/MTok
Opus tier — Claude Opus 4.8 (current flagship) ~$5.00 ~$25.00 ~$0.50
Sonnet tier — Claude Sonnet 4.6 ~$3.00 ~$15.00 ~$0.30
Haiku tier — Claude Haiku 4.5 ~$1.00 ~$5.00 ~$0.10

Read across that table and the levers reveal themselves. The cached read column is roughly a tenth of the input column — that is caching. The gap between the Opus and Haiku rows is five-fold on input and five-fold on output — that is routing. We confirmed these figures against the current Anthropic model reference; the top Opus-tier model is Claude Opus 4.8 at approximately $5 input and $25 output per million tokens. Treat every number as a snapshot. Prices have fallen steadily, and the method is designed to keep working as they do.

Lever 1 — Prompt caching

Prompt caching is the highest-leverage change for most teams because it requires no quality trade-off whatsoever. You get the identical model output; you simply stop paying full price to re-process the unchanging part of your prompt.

How the economics work

The mechanism is a prefix match. The first time you send a given prefix, the provider writes it to a cache, charging a premium of roughly 1.25 times the base input rate for that write (under the default five-minute time-to-live). Every subsequent request that reuses the byte-identical prefix reads from the cache at roughly 0.1 times the base input rate — a saving of up to about 90% on those tokens. The default cache lifetime is five minutes, extendable to one hour for bursty traffic with longer gaps.

The break-even is worth internalising. With the five-minute TTL, you pay 1.25x once and 0.1x thereafter, so two requests already beat the 2x you would otherwise pay for two uncached reads. With the one-hour TTL the write costs about 2x, so you need roughly three reuses to come out ahead. The takeaway: caching rewards reuse. A prompt sent once should not be cached; a prompt sent fifty times within a session absolutely should.

What to cache

Cache the stable prefix and only the stable prefix:

  • System prompts. Usually large, almost always constant. The single best caching candidate.
  • Few-shot examples. If you prepend a handful of worked examples to steer the model, they belong in the cached region.
  • Tool and function definitions. These render at the front of the prompt and rarely change between requests.
  • Large retrieved context (RAG). When several questions are asked against the same retrieved document set within a session, cache the documents once and vary only the question after the cache boundary.
Watch out

Caching is a prefix match, so any byte change anywhere in the prefix invalidates everything after it. The classic silent killers are a datetime.now() stamped into the system prompt, a request UUID near the top of the content, or a JSON blob serialised without sorted keys. Put volatile content — timestamps, per-request IDs, the user's actual question — strictly after the last cache breakpoint, and verify cache hits via the usage field rather than assuming.

A code snippet

The pattern is to mark the last block of the stable prefix as cacheable and append the volatile question afterwards. In the Claude SDK that looks like this:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = "..."          # large, stable — the prefix worth caching
RETRIEVED_DOCS = "..."         # constant across this session's questions

def ask(question: str):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {"type": "text", "text": SYSTEM_PROMPT},
            {
                "type": "text",
                "text": RETRIEVED_DOCS,
                "cache_control": {"type": "ephemeral"},  # cache up to here
            },
        ],
        # The volatile part comes AFTER the cache boundary, so it never
        # invalidates the cached prefix.
        messages=[{"role": "user", "content": question}],
    )

# First call writes the cache (~1.25x input on the prefix).
# Every later call within the TTL reads it (~0.1x input on the prefix).
resp = ask("Summarise the indemnity clause.")
print(resp.usage.cache_read_input_tokens)   # > 0 confirms a hit

The verification line matters. If cache_read_input_tokens is zero across repeated identical-prefix calls, a silent invalidator is at work and you are paying full price while believing you are not. Check the number, do not trust the intention.

If you are building retrieval pipelines where the same corpus is queried repeatedly, the same caching discipline applies to your context window — see our guide on hybrid retrieval for production RAG for how to keep that retrieved context stable enough to cache.

Lever 2 — Model routing and cascades

Caching makes each call cheaper. Routing makes sure you are not making an expensive call in the first place. The core insight is that workloads are not uniform: most requests are easy, a minority are hard, and paying frontier rates for the easy majority is pure waste.

Bucket requests into complexity tiers

Start by classifying incoming requests into three or four tiers by difficulty, then map each tier to the cheapest model that handles it reliably:

  • Trivial — classification, extraction, short rewrites, yes/no judgements. A small open-weight model or the Haiku tier handles these.
  • Standard — summarisation, routine drafting, structured generation. The Sonnet tier is the sweet spot.
  • Hard — multi-step reasoning, long-context analysis, agentic work. This is where the Opus tier earns its rate.
  • Escalation — the small fraction that fails everything below and needs the frontier model with maximum effort.

Published routing data shows just how skewed real traffic is. In one widely cited production distribution, around 41% of requests were handled by a 7B open-weight model, about 36% by a mid-tier model, roughly 18% by a GPT-4-class model, and only about 5% escalated to the most capable tier. If you were paying frontier rates for all of that, more than three-quarters of your spend was avoidable.

The cheap-first, escalate-on-failure pattern

A cascade is the disciplined version of routing. Rather than guessing the right tier up front, you try the cheapest one, then check the result with a fast, deterministic validator. The validator is the whole trick: it must be cheaper and faster than the model call it guards, or the cascade adds cost. Good validators include a JSON-schema check, a regex for a required format, a length or sanity bound, or a tiny classifier model that scores confidence.

Recommended

Make the validator deterministic wherever you can. A schema check or regex costs microseconds and never has an opinion; a tiny model judge costs a fraction of a frontier call. Reserve LLM-as-judge validation for genuinely subjective quality gates, and never let the validator cost approach the cost of the model it is gating.

Industry results for routing and cascades cluster around a 45 to 85% cost reduction while retaining roughly 95% of quality. The variance depends almost entirely on how skewed your traffic is and how good your validator is. Here is the pattern in code:

import json
import anthropic

client = anthropic.Anthropic()

# Cheapest first, most capable last.
CASCADE = ["claude-haiku-4-5", "claude-sonnet-4-6", "claude-opus-4-8"]

def validate(text: str) -> bool:
    """Fast, deterministic gate. Cheaper than any model call."""
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        return False
    return "answer" in data and isinstance(data["answer"], str)

def answer(prompt: str) -> str:
    for model in CASCADE:
        resp = client.messages.create(
            model=model,
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )
        text = "".join(b.text for b in resp.content if b.type == "text")
        if validate(text):
            return text          # passed — ship the cheap result
        # failed the gate — escalate to the next tier up
    return text                  # exhausted the cascade; return best effort

This loop spends Haiku-tier money on the easy majority and only reaches for Opus on the genuinely hard tail. The deterministic validate step is what makes it safe: nothing ships unless it passes a check you control. For teams running agents rather than single calls, the same tiering logic governs which model each step of the agent loop uses — our notes on orchestrating multi-agent subagents in production cover how to keep the expensive model on the coordinating step and push cheap models to the leaf work.

Every article here is written by a Verified Builder. Want your name on the next one?

AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.

Become a Verified Builder →

Lever 3 — Batch processing

The third lever is the simplest and the most overlooked. Any workload that does not need its answer immediately — overnight enrichment, bulk classification, document back-processing, evaluation runs, content generation pipelines — can go through an asynchronous batch endpoint at roughly 50% of the synchronous price.

You submit a set of requests, the provider processes them within a window (typically under an hour, capped at 24), and you collect the results. All the usual features still apply, including prompt caching. That is the important part: batching and caching compose. Layer a 50% batch discount on top of a shared cached prefix and the savings multiply rather than add.

The published ceiling for the two combined is striking: batch plus caching can cut cost by up to about 95% versus standard synchronous calls for the right workload — one with a large shared prefix and no latency requirement. Few production features hit that, but bulk back-office jobs frequently do.

Pro tip

Split your traffic by latency tolerance, not by feature. The interactive chat box needs a synchronous call; the nightly job that re-tags six months of support tickets does not. Routing the latter through a batch endpoint with a cached system prompt is often a single afternoon's work for a permanent halving of that line item.

Putting it together — a worked example

Consider a realistic mixed workload: a customer-support assistant handling 100,000 requests a day. Each request carries a 2,000-token stable system prompt plus retrieved context, a 200-token user question, and produces a 300-token answer. Naively, every request goes to the Opus tier synchronously. Let us cost the before and after.

Before: 100,000 requests × 2,200 input tokens at $5/MTok, plus 300 output tokens at $25/MTok. That is roughly $1,100/day input + $750/day output = $1,850/day, or about $55,000 a month.

Now we apply all three levers. Cache the 2,000-token prefix so it bills at the cached rate after the first write. Route by complexity: assume 60% of requests are trivial enough for the Haiku tier, 35% need the Sonnet tier, and 5% escalate to Opus. Push the 20% of traffic that is non-interactive (overnight summaries, ticket tagging) through the batch endpoint at half price.

Stage What changes Approx. daily cost Cumulative saving
Baseline All Opus, synchronous, no caching ~$1,850
+ Prompt caching 2,000-token prefix billed at ~0.1x after the first write ~$1,150 ~38%
+ Model routing 60% Haiku, 35% Sonnet, 5% Opus by complexity ~$780 ~58%
+ Batch (20% of traffic) Non-interactive jobs at half price ~$720 ~61%

That lands at roughly a 60% reduction — from about $55,000 a month to around $22,000 — with no change to the answers users see on the interactive path, because the cheap models only handle requests that pass a deterministic quality gate. The exact split depends on your traffic shape, but the order of operations holds: cache first (free quality), route second (small, controlled quality trade-off), batch third (latency trade-off only).

The dual-market angle matters for the absolute numbers. If your inference runs in an AWS Mumbai (ap-south-1) region serving Indian users, or a London (eu-west-2) region serving UK users, the API token prices above are the same — they are set by the model provider, not the cloud region. What differs is the surrounding infrastructure cost and how you frame the saving internally. A 60% cut on a $55,000 monthly bill is roughly £17,000 saved each month for a UK finance team, or about ₹19 lakh for an Indian one — the kind of figure that funds another engineer either way.

Measuring and monitoring cost

Optimisation without measurement is guesswork, and guesswork regresses. Three habits keep the savings in place:

  • Tag every call. Record model, input tokens, output tokens, cache-read tokens and a feature label on each request. A simple per-feature daily rollup tells you instantly where money is going and whether a cache stopped hitting.
  • Track cache-hit rate as a first-class metric. A falling hit rate is the earliest sign that someone added a timestamp to a system prompt or reordered a tool list. Alert on it.
  • Watch the escalation rate in your cascade. If the share of requests escalating to the top tier creeps up, either traffic has genuinely got harder or a cheaper tier has regressed — both are worth knowing before the bill arrives.

For deeper cost economics across the inference stack — GPU rates, region effects and unit economics — our news analysis on AI inference cost economics in 2026 sets the wider context these tactics operate within.

Common pitfalls

  1. Caching content that never repeats. If a prefix is sent once, the 1.25x write premium is pure loss. Only cache what is genuinely reused within the TTL.
  2. A silent cache invalidator. A dynamic timestamp, a UUID or non-deterministic JSON serialisation in the prefix quietly defeats caching. Verify hits with the usage field; do not assume.
  3. An expensive validator. A cascade only saves money if the validator is much cheaper than the call it gates. An LLM-as-judge that costs nearly as much as a frontier call defeats the purpose.
  4. Routing without a quality gate. Sending hard requests to a weak model and shipping the result blind trades cost for silent failures. The deterministic check is non-negotiable.
  5. Batching latency-sensitive traffic. The discount is real, but a user staring at a spinner does not care. Split by latency tolerance, never force it.
  6. Optimising the rate card instead of the method. Prices fall; chasing the cheapest provider this quarter is a treadmill. Build caching, routing and batching once and they keep paying out as prices move.
Watch out

Do not over-engineer before you measure. A team can spend a week building an elaborate router for an endpoint that turns out to be 2% of the bill. Instrument first, find the two or three endpoints that dominate spend, and apply the levers there. Everything else can wait.

Next steps

Start in order. Turn on prompt caching for your largest, most stable prompts this week — it is free quality and the fastest win. Next, classify your traffic by complexity and build a two-tier cascade with a deterministic validator; expand to three tiers once it is stable. Finally, find the non-interactive jobs and move them to a batch endpoint. Measure cache-hit rate and escalation rate throughout, and re-check the provider rate card before quoting any figure — the method is durable, the numbers are not.

These are exactly the kinds of decisions that separate engineers who can ship an LLM feature from those who can run one profitably. If you are doing this work in India or the UK, it is worth being findable for it.