Why caching is the cost lever most teams miss
Walk into any growing AI team in Bengaluru or London and ask what their largest controllable infrastructure cost is. The honest answer, more often than not, is tokens — specifically, the same expensive context being re-sent on every call. A retrieval-augmented chatbot pasting the same 40k-token system prompt and tool schema into every turn. An evaluation harness re-uploading 200k tokens of test cases for each model variant. A coding agent that reads 800k tokens of repository context on every single edit.
Every major vendor now offers a way to make those repeated prefixes effectively free, or at least drastically cheaper. The mechanics differ, the surface area differs, and the gotchas differ — but the underlying idea is the same: the model has already seen this prefix, so don't make us re-process it from scratch. Anthropic calls it prompt caching. OpenAI calls it (implicit) prompt caching. Google calls it context caching.
The wins are not subtle. A team running an evaluation suite on a 100k-token context can routinely cut bill-paying input cost by 80 to 90 per cent against the uncached price, depending on the vendor and TTL. A coding agent that re-reads the same monorepo for every edit lands in the same ballpark. And yet, in practice, most teams ship without caching enabled, or enable it incorrectly, and never actually measure their hit rate. That is the gap this article aims to close.
What follows is a vendor-by-vendor walk-through of the three caching surfaces, the prefix-stability rule that governs all of them, the prompt-layout patterns that actually move hit rate, and a cheat sheet for builders who need to ship next week — not in six months.
Anthropic — explicit cache_control, 5-min TTL, 1-hour beta
Anthropic's prompt cache is the most engineer-controllable of the three. You opt in to caching by tagging specific blocks of your prompt with cache_control. The cache stores the prefix from the start of the prompt up to and including any block marked with that flag, with a TTL of five minutes by default and a 1-hour beta tier available at the time of writing. See docs.anthropic.com for current pricing and the latest behaviour — rates and beta surface change.
The classic layout is to mark your system prompt, your tool definitions, and any large reference documents at the top of the conversation with cache_control, then leave the conversational user/assistant turns at the bottom uncached. On the next call within the TTL, the cached prefix is read at a steep discount and only the new user turn pays the full input rate.
// Anthropic — explicit cache_control on a long reference block
{
"model": "claude-opus-4-7",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a compliance assistant for UK and India financial services...",
"cache_control": { "type": "ephemeral" }
},
{
"type": "text",
"text": "<document>...90,000 tokens of regulatory text...</document>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "Does Article 14 apply to retail brokers?" }
]
}
Two things to internalise. First, the cache hit covers everything from the start of the prompt up to the marked block — not just the marked block itself. Second, the cache write on the first call is more expensive than a normal input token, so the economics only pay off if you actually re-use the cached prefix at least once within the TTL. Single-shot calls with caching enabled are slightly more expensive than single-shot calls without it.
Place your cache_control breakpoints at the largest stable boundary you can find — typically the end of your system prompt and the end of your tool/document block. Anthropic supports multiple breakpoints; use them sparingly to avoid fragmenting the cache.
OpenAI — implicit auto-cache, what triggers a hit
OpenAI takes a different approach: there is no cache_control API. Caching is implicit. When a request reuses approximately the first 1,024 tokens of prefix from a recent call, the cached portion is automatically billed at a discount and the cache hit is reported in the response usage block. See platform.openai.com for current rates and minimum size, which the vendor adjusts from time to time.
This is enormously convenient — you do not need to change client code to benefit — but it shifts the engineering work from "configure caching" to "ensure your prefixes are actually stable". The model cannot tell you it would have cached if you had structured your prompt slightly differently. You have to design for it.
// OpenAI — same request shape, no cache hints needed
{
"model": "gpt-5.5",
"messages": [
{ "role": "system", "content": "<stable 8,000-token system prompt + tool schema>" },
{ "role": "user", "content": "What did the customer ask on Tuesday?" }
]
}
// Response usage block will surface cached_tokens when the prefix matches.
What this means in practice: any change to the system prompt, the tool definitions, or any content before the user message invalidates the cache. The first 1,024 tokens have to match exactly, byte-for-byte. A change to the date string in your system header, a swap of two tool definitions, a re-ordering of fields in a structured schema — any of these will silently turn a 90 per cent hit rate into a 0 per cent hit rate without any error message.
Because OpenAI caching is implicit, regression is silent. You can ship a one-line "improvement" to your system prompt and not notice the bill has tripled until the end of the month. Log usage.prompt_tokens_details.cached_tokens on every call and alert when the ratio drops.
Gemini — explicit context caches you create and reference
Google's Gemini model exposes the most operational caching surface of the three. You explicitly create a cache object on the server side, get back a cache identifier, and reference that identifier in subsequent generation calls. The cache has its own lifecycle: a TTL, a cost for storage during its lifetime, and a discounted rate for cache-hit usage. See ai.google.dev for the current surface and pricing.
# Gemini — explicit cache: create, then reference
cache = client.caches.create(
model="gemini-2.5-pro",
config={
"contents": [...large_document...],
"system_instruction": "You are a compliance assistant...",
"ttl": "3600s",
},
)
# Subsequent calls reference the cache by name
response = client.models.generate_content(
model="gemini-2.5-pro",
contents="Does Article 14 apply to retail brokers?",
config={"cached_content": cache.name},
)
This model is a better fit when you genuinely have one large stable context — a knowledge-base snapshot, a long PDF, a reference codebase — and you know you will hit it many times during the TTL window. It is a worse fit for a chatbot where the "stable" portion changes every release. Storage during the TTL is billed separately from the cache hits themselves, so creating a cache and only hitting it once or twice is rarely a win.
The prefix-stability rule — one changed token invalidates everything below
All three vendors share one fundamental constraint, and it is the single most common reason production cache hit rates are dismal. Caching works on exact prefix matches. The model's processing is autoregressive — it builds up state token by token from the start — so the cache is keyed on the sequence of tokens from position zero forward. Change any token, and the cache from that point onwards is invalid.
This sounds obvious, but the failure modes are subtle. Consider these patterns we have seen in production code from teams across India and the UK:
- Dynamic timestamps in the system prompt — "Today is 26 May 2026 at 11:42 UTC." That string changes every call. Every call is a cache miss.
- Request IDs threaded into the system prompt for tracing — same problem, different cause.
- Conditional tool definitions — feature-flagged tools that come and go based on user entitlement, sitting at the top of the prompt. Different users get different prefixes; no cross-user caching.
- JSON schema field re-ordering — Python dicts in older versions did not guarantee key order, so the serialised schema could differ between calls. A trap that has been quietly closed in modern runtimes, but worth checking.
- Personalisation prepended to the system prompt — "The user's name is Anita and her preferred language is Hindi." Personalisation belongs at the bottom of the prompt, not the top.
Do not include dynamic timestamps, request IDs, or per-user personalisation in your system prompt or tool definitions. Move all dynamic content into the latest user message. Putting today's date at the top of the prompt is the most common silent cache-killer in production AI apps.
Prompt layout that maximises hit rate
If there is one mental model that locks in good caching behaviour, it is this: design your prompt as concentric rings of stability. The innermost ring — the very start of the prompt — should change once a quarter at most. The outermost ring — the bottom of the prompt — changes every call. Everything in between sits on a spectrum of churn.
| Where in the prompt | Put here | Never put here |
|---|---|---|
| Very top (rarely changes) | System role, tone, output format rules | Today's date, session ID, user name |
| Top middle (changes weekly) | Tool definitions, function schemas | Feature-flag-conditional tools |
| Middle (changes per session) | Large reference documents, knowledge base snapshots | Real-time data, live metrics |
| Lower middle (per user) | User profile, preferences, conversation memory | Last-second corrections |
| Bottom (every turn) | The actual user message, retrieved RAG snippets, current timestamp if needed | Anything stable |
Here is the same prompt re-arranged to be cache-friendly. Note that the rewrite changes nothing about the model's behaviour — same instructions, same data, same answer — but the cache hit rate on the second call goes from 0 per cent to roughly 95 per cent of the input.
Before (cache-hostile):
System: Today is 2026-05-26 11:42 UTC. Session 8f3a-b41e.
The user is Anita from Bengaluru on the Pro plan.
You are a compliance assistant for UK and India financial services.
[80,000 tokens of regulatory reference text]
[3,000 tokens of tool definitions]
User: Does Article 14 apply to retail brokers?
After (cache-friendly):
System: You are a compliance assistant for UK and India financial services.
[3,000 tokens of tool definitions]
[80,000 tokens of regulatory reference text]
<-- cache_control breakpoint here on Anthropic -->
User: [Context: today is 2026-05-26 11:42 UTC. Session 8f3a-b41e.
The user is Anita from Bengaluru on the Pro plan.]
Does Article 14 apply to retail brokers?
Treat the first 4–8k tokens of your prompt as a write-once-per-release artefact. If you find yourself changing them per call, that change probably belongs in the user message instead.
A cross-vendor comparison
The three caching surfaces differ in shape but converge on the same economic outcome when used well. The table below captures the structural differences as of publication — see each vendor's docs for current rates, which all three move frequently.
| Capability | Anthropic | OpenAI | Gemini |
|---|---|---|---|
| Control model | Explicit (cache_control flags) | Implicit (auto on matching prefix) | Explicit (create cache object) |
| Default TTL | 5 minutes (1-hour beta available) | Vendor-managed, short-lived | Configurable per cache, up to hours |
| Minimum cache size | Per docs — small block minimum | ~1,024 token prefix | Per docs — substantial minimum |
| Billing model | Cache write + discounted cache read | Discounted cached tokens, no write fee | Storage during TTL + discounted cache-hit usage |
| Control surface | Per-request flags in API body | None — design your prefix to be stable | Separate cache lifecycle endpoints |
| Best fit | Mixed-stability prompts with clear breakpoints | Drop-in for any team with stable prefixes | Single large context reused across many calls |
Measurement — how to actually verify your hit rate
You cannot optimise what you do not measure. Every vendor surfaces cache statistics in the response, and every production AI app should be logging them. If you do not currently track cache hit rate, that is the highest-leverage observability change you can make this week.
- Anthropic: the response
usageblock returnscache_creation_input_tokensandcache_read_input_tokens. Log both. The ratio of cache_read to total input is your hit rate. - OpenAI: the response
usage.prompt_tokens_details.cached_tokensfield tells you how many of the input tokens were billed at the cached rate. - Gemini: cache usage is visible in the response metadata when you specify
cached_contenton the call.
The right alerting threshold depends on your workload. For a high-volume RAG chatbot with a stable knowledge base, anything under 80 per cent hit rate after a release suggests something at the top of the prompt has shifted. For an evaluation harness running variants, you should see near 100 per cent on the shared base prompt.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →When caching loses you money
Caching is not free, and there are workloads where enabling it costs more than disabling it. The two main loss patterns are small workloads and high-churn prefixes.
Small workloads: if your typical request only has a 2k-token prefix and you make fewer than a handful of calls per TTL window, the write/storage overhead can dominate the savings. Anthropic's cache write is more expensive per token than a normal input read, so a single cached call that never gets a follow-up is slightly more expensive than the same call uncached. Gemini's separate storage fee follows the same logic — if you create a cache and only hit it twice, you may not break even.
High-churn prefixes: if your system prompt or tool definitions change frequently — perhaps because you are iterating fast on prompt design, or because you have heavy per-user personalisation that you have not yet refactored to the bottom of the prompt — the cache misses outnumber the hits. Every missed cache write is wasted money. Measure before you optimise.
If you are deep in prompt-engineering iteration and changing the system prompt several times an hour, turn caching off until the prompt stabilises. Then turn it back on. Otherwise you pay the write premium without ever banking the read savings.
A cross-vendor caching cheat sheet for IN and UK builders
If you take only one set of action items away from this article, make it these.
- Audit the first 1,024 tokens of every distinct prompt template you ship. If anything in that block changes between calls, your hit rate is broken. Move dynamic content to the user message.
- Log cache statistics on every call. Anthropic surfaces
cache_read_input_tokens; OpenAI surfacescached_tokens; Gemini surfaces cache metadata. Pipe these to your observability stack and alert when hit rate drops. - Layer your prompt by stability. Concentric rings: rarely-changing role and format rules at the top, tool definitions next, reference documents below that, per-user data lower, the user turn last.
- Choose your vendor's caching surface to match your workload. Implicit (OpenAI) for low-engineering drop-in; explicit cache_control (Anthropic) for mixed-stability prompts with clear breakpoints; explicit cache objects (Gemini) for one large context reused many times.
- Validate before you ship a prompt change. A "small wording tweak" at the top of the system prompt can wipe out your cache. Diff the first 1,024 tokens before any release.
- Read the vendor's pricing pages on the day you launch. All three vendors update rates frequently. Bake the current rates into your cost model and re-check quarterly.
None of this is glamorous work. It is plumbing. But the teams in Bengaluru, Mumbai, London and Manchester that take caching seriously are routinely running long-context workloads at a fraction of the bill of teams that do not. That is the gap that gets you to break-even faster on any AI feature you ship.
Related reading on long-context economics: our explainer on Claude Opus 4.7 and the 1M-token window covers how caching multiplies the value of long context; the RAG production playbook for 2026 covers the retrieval side of the same problem; and AI inference costs in 2026 situates caching inside the broader unit-economics picture.