What this guide gets you

If you ship anything agentic, your LLM API bill has probably tripled in twelve months — not because prices rose, but because agents make far more calls than chatbots ever did. The good news: most of that spend is recoverable. Five levers, applied in the right order, routinely cut a production LLM bill by 70-85% without touching what the model actually produces.

  • Caching is the biggest single win — cache hits can save up to roughly 90%, and Anthropic cached reads cost about 10% of the base input price.
  • Batching is free money for any workload where the user is not waiting — a flat ~50% discount on Anthropic and OpenAI batch APIs.
  • Model routing sends easy prompts to cheap models and reserves premium models for hard tasks, typically saving 40-70%.
  • Context compaction and prompt trimming shrink every request before it is sent — 50-70% fewer tokens with no change in answer quality.

We will work through each lever with the numbers, then give you an implementation order and a checklist for your first week. Examples are deliberately drawn from both an Indian SaaS startup and a UK scale-up, because the maths is identical whether you bill in rupees or pounds.

Lever 1 — Prompt caching

Most agent requests repeat a large, stable block of text: a system prompt, a tool catalogue, a few-shot example set, retrieved documents. Re-sending and re-processing that block on every call is the most expensive habit in production AI. Prompt caching fixes it. On Anthropic, a cached read costs roughly 10% of the base input price — a 10x reduction on the cached portion of the prompt. Cache hits can save up to about 90% of input cost on a repetitive workload.

There are three approaches, and they are not mutually exclusive:

Exact prompt caching

The provider stores the processed prefix of your prompt and reuses it when the next request starts with byte-identical text. You mark a cache breakpoint after your stable block; everything before it is served from cache. This is the lowest-effort win available — frequently a single flag on the request.

Semantic caching

Instead of requiring an exact match, a semantic cache embeds the incoming prompt and serves a stored answer when a previous prompt is close enough. This catches the long tail of "same question, different wording" — invaluable for support bots and FAQ-style agents. It needs an embedding store and a similarity threshold, so it is more engineering than exact caching.

KV-cache optimisation

If you self-host on an inference engine, the key-value cache holds attention state for tokens already processed. Structuring prompts so the stable prefix never changes lets the engine reuse that state across requests. This is the self-hosted equivalent of exact prompt caching and pairs naturally with a managed serving stack.

Pro tip

Caching earns its keep when your system prompt is over roughly 1,024 tokens and you make thousands of requests a day. Put the stable content — instructions, tool schemas, retrieved context — at the very top of the prompt and the volatile content (the user's actual question) at the bottom. One Indian SaaS team that did exactly this cut their LLM costs by around 59-70% versus paying full input rates, with no change to outputs.

Lever 2 — Batching

A surprising share of LLM spend goes to work nobody is waiting for: overnight summarisation, classification back-fills, evaluation runs, document tagging, embedding refreshes. For all of it, real-time latency is wasted money. Both Anthropic and OpenAI offer batch APIs that process requests asynchronously at roughly a 50% discount, returning results within 24 hours instead of in real time.

The mental model is simple: if a human is staring at a loading spinner, use the standard API; if the output lands in a database, a report or a queue, use the batch API. For most teams this asynchronous tier covers 20-40% of total LLM spend, and moving it to batch halves that slice immediately. A UK scale-up running nightly enrichment over its customer records simply re-pointed those jobs at the batch endpoint and watched a meaningful fraction of the monthly bill disappear with one deployment.

Batching also composes with everything else here. A batch job can still use a cached prefix, and it can still be routed to a cheap model — the discounts stack.

Lever 3 — Model routing

The most common waste in agent systems is sending every prompt to the most capable, most expensive model. In practice, 60-80% of agent requests are routine: classify this, extract that field, format this list, decide which tool to call. They do not need a frontier model. A router sits in front of your model calls, classifies each prompt by difficulty, and dispatches accordingly — easy tasks to a cheap model, hard tasks reserved for a premium one.

On indicative 2026 pricing, a fast small model such as Haiku 4.5 sits around $1 per million tokens, while a frontier model such as Opus 4.6 sits around $5 per million tokens. If three-quarters of your traffic can safely move from the $5 tier to the $1 tier, the arithmetic is obvious. Teams that route well typically save 40-70% on total model spend.

Routing needs more engineering than caching — you have to build or adopt a classifier, define difficulty tiers, and add evaluation to confirm the cheap model is good enough on the tasks you send it. But it moves the needle hard, and it is the lever that scales best as your traffic grows. For a deeper comparison of where to actually run these models, see our breakdown of DeepInfra, Together, Fireworks and Groq as inference platforms.

Watch out

A router is only as good as its evaluation harness. If you down-route a task the cheap model quietly fails, you trade a small cost saving for a quality regression that reaches users. Before routing any task class to a cheaper model, run a few hundred real prompts through both tiers and compare outputs. Re-run that check whenever you change models or prompts.

Lever 4 — Context compaction

Long-running agents accumulate history: every tool call, every observation, every intermediate step stays in the conversation and gets re-sent on the next turn. By turn twenty, you may be paying to re-process tens of thousands of tokens of stale chatter. Context compaction removes the redundant tokens from conversation history before the request is sent.

Compaction is not the same as summarisation. Summarisation rewrites history into a shorter paraphrase — which costs an extra model call and risks losing detail. Compaction deletes verbatim: it strips out noise such as redundant tool output, repeated boilerplate and dead branches, while every sentence that survives is preserved character-for-character. Nothing is rewritten. In practice this yields a 50-70% token reduction on a busy agent transcript, and because surviving content is byte-identical, downstream behaviour does not change.

Compaction also stacks neatly with caching: a compacted, stable prefix is exactly what an exact-match cache wants. If you are pushing context windows hard, our piece on Claude Opus 4.7 and the 1M-token window covers why a bigger window is not a substitute for keeping the prompt lean.

Lever 5 — Prompt optimisation and trimming

The cheapest token is the one you never send. Most production prompts are bloated — verbose instructions written and never edited, redundant few-shot examples, politeness padding, restated rules. Prompt optimisation is the unglamorous work of editing that text down.

Concretely: cut few-shot examples to the minimum that holds quality, replace long prose instructions with terse bullet rules, remove duplicated guidance, and delete examples that no longer reflect your task. Trim retrieved context to the top passages that actually answer the question rather than dumping the whole retrieval set. None of this needs new infrastructure — it is editing — which is why it has the lowest implementation cost of any lever here. The savings are modest per request but compound across millions of calls, and a leaner prompt also caches and compacts better.

Treat prompt trimming as a recurring task, not a one-off. Prompts drift upward over time as people add "just one more rule"; schedule a quarterly prompt review the way you would a dependency audit.

Combined impact — what the five levers do together

Individually each lever is useful. Together they are transformative, and crucially they multiply rather than merely add, because they attack different parts of the bill. Caching cuts the cost of repeated input. Batching halves the cost of non-interactive work. Routing cuts the per-token rate on the majority of traffic. Compaction and trimming shrink the token count of every request before any of the above applies.

Layer them and a team typically cuts total LLM spend by 70-85% — without changing what the agent produces. That last clause matters: none of these levers degrade output quality when implemented correctly. They remove waste, not capability. The remaining 15-30% of the bill is the genuine, irreducible cost of the intelligence you are actually buying.

Lever Typical saving Implementation effort When to use
Prompt caching Up to ~90% on cached input Low (exact) to Medium (semantic) Stable prompt > ~1,024 tokens, thousands of requests/day
Batching ~50% flat discount Low Any workload where the user is not waiting
Model routing 40-70% Medium to High Mixed-difficulty traffic; 60-80% of requests routine
Context compaction 50-70% fewer tokens Medium Long-running agents with growing conversation history
Prompt trimming Modest, compounds at scale Low (editing only) Always — bloated, unedited prompts

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Implementation order — start where the effort is lowest

Do not attempt all five at once. Sequence the work by effort-to-impact ratio, ship one lever, measure the bill, then move on.

  1. Provider-native prompt caching — the lowest implementation cost of all. Often a single cache breakpoint on the request. Restructure prompts so the stable block sits first, enable caching, done.
  2. Prompt compression and trimming — equally cheap because it is editing, not engineering. Cut the bloat while you are restructuring prompts for caching anyway.
  3. Batching — re-point every non-interactive job at the batch endpoint. Low effort, immediate ~50% on that slice of spend.
  4. Context compaction — more engineering, but a clean win for any long-running agent. Adopt a compaction layer in your agent loop.
  5. Model routing and semantic caching — the highest-effort levers. They take real engineering — a classifier, difficulty tiers, an evaluation harness — but they move the needle the most, so they are worth the investment once the cheap wins are banked.

The principle: provider-native caching and prompt compression have the lowest implementation cost, so start there. Routing and semantic caching take more engineering but deliver the largest sustained savings — schedule them deliberately rather than skipping them.

Your first week of cost cuts

A concrete five-day plan you can run starting tomorrow:

  • Day 1 — Measure. Pull last month's API usage and break it down by endpoint, model and interactive-versus-batch. You cannot cut what you have not measured.
  • Day 2 — Cache. Identify your largest stable prompt block, move it to the top of the prompt, and enable provider-native prompt caching. Confirm cache-hit rate in the response metadata.
  • Day 3 — Trim. Edit your top three prompts: cut redundant few-shot examples, condense instructions to bullet rules, drop politeness padding.
  • Day 4 — Batch. List every job where no user is waiting and re-point it at the batch API. Verify results land within the 24-hour window.
  • Day 5 — Plan routing. Sample 200 real prompts, tag each as easy or hard, and estimate the share that could safely move to a cheaper model. This becomes the business case for your routing sprint.

By the end of week one you should already see caching and batching savings in the dashboard, with a costed plan for routing and compaction. If you are also choosing where to run inference, our inference platform comparison is the natural next read. And if cost engineering is your craft, add your profile so teams hiring for exactly this skill can find you, or browse Builders who have shipped these patterns in production.