What prompt caching is and why it saves so much
Every time you call an LLM API, the provider's inference servers process every token in your prompt from scratch — parsing, attending, computing. That processing is where most of the cost lives. Prompt caching lets the provider store the key-value (KV) state for a portion of your prompt so that subsequent calls with the same prefix can skip the expensive recomputation entirely.
The result is dramatic. Anthropic reports up to 90% cost reduction and 85% latency reduction for long prompts when cache hits occur. Across all major providers, 2026 benchmarks show prompt caching reduces API costs by 41–80% and improves time-to-first-token (TTFT) by 13–31%.
The savings are so substantial because most production LLM workloads have a large, static prefix — a system prompt, a set of tool definitions, a retrieved document corpus — followed by a small, dynamic query. The static portion can be 95%+ of your total token budget. Cache that, and you pay almost nothing for the tokens that dominate your bill.
If you are already familiar with the high-level case for caching and want to explore broader cost strategies, see the LLM cost optimisation overview. This guide is the implementation deep-dive: how caching actually works under the hood, and precisely how to implement it on each major provider.
How caching works under the hood
All three major providers implement caching as prefix matching against a stored KV representation. The mechanics differ in important ways:
Prefix matching. The cache key is not a hash of your full prompt — it is a prefix of it. If the first N tokens of your current request match the first N tokens of a previously cached prompt, you get a cache hit on those N tokens. Everything after the matched prefix is processed normally. This is why structure matters enormously: the stable content must come first.
Cache entries, not whole prompts. On Claude, you mark specific content blocks with cache_control. The provider stores the KV state up to and including that block. On GPT-4o, the provider decides automatically what to cache — you have no explicit control. On Gemini, you create a named cache object that can be shared across requests.
TTL (time-to-live). Cached KV state is not stored forever. It expires after a provider-defined window. On expiry, the next request reprocesses the full prompt at standard pricing, then refreshes the cache. For Claude (as of January 2026), the TTL can be set to 5 minutes or 1 hour for Claude Haiku 4.5, Claude Sonnet 4.5, and Claude Opus 4.5. OpenAI does not expose TTL control. Gemini's Context Caching API accepts a configurable TTL.
Cache write vs. cache read pricing. Writing to the cache (the first call that populates it) typically costs slightly more than a normal input token. Cache reads are where the savings are: 10× cheaper on Claude, 2× cheaper on GPT-4o. Architect your sessions to write the cache once and read from it many times.
A published arXiv paper, "Don't Break the Cache" (arxiv.org/pdf/2601.06007), evaluates prompt caching behaviour for long-horizon agentic tasks in depth — recommended reading if you are building agents with long action sequences.
Claude: cache_control blocks, TTL options, and pricing
Claude's approach to caching is explicit and granular. You add a cache_control parameter to any content block you want the provider to cache. The cache checkpoint is created at that block's boundary — tokens before the checkpoint are eligible for caching on subsequent calls with the same prefix.
As of June 2026, Claude 3.5 Sonnet pricing is $3 per million input tokens for fresh (uncached) processing. Cached token reuse costs $0.30 per million tokens — 10× cheaper. Cache write costs a small premium over standard input. The 5-minute TTL is appropriate for interactive sessions; the 1-hour TTL is better for batch pipelines and agent loops that run over a longer horizon.
A team spending $5,000 per month on Claude API that adds prompt caching to a RAG pipeline with a large document context could realistically reduce to under $1,000 per month — an 80% saving — simply by moving the document corpus before the query and adding a single cache_control block.
Here is the minimal implementation using the Anthropic Python SDK. The cache_control block with "type": "ephemeral" instructs Claude to cache everything up to and including the system prompt block.
import anthropic
client = anthropic.Anthropic()
# System prompt with a large, stable context (e.g. RAG documents, tool definitions)
SYSTEM_PROMPT = """You are an expert analyst. You have access to the following documents:
[... large document corpus, 50k+ tokens ...]
Answer questions based only on these documents. Cite your sources."""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache everything up to this block
}
],
messages=[
{"role": "user", "content": "What were the key findings on page 12?"}
]
)
# Check cache usage in the response
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
On the first call, cache_creation_input_tokens will be populated and cache_read_input_tokens will be zero. On subsequent calls within the TTL window, those numbers flip — and your billing drops by up to 90%.
You can also cache multiple blocks at different depths. For agent systems, cache the system prompt at one checkpoint and the tool definitions at a second, later checkpoint. This lets you invalidate tool definitions independently of the system prompt if your tool set changes.
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = "You are a helpful coding assistant with expertise in Python and TypeScript."
TOOL_DEFINITIONS_TEXT = """
Available tools:
- run_python(code: str) -> str: Execute Python code and return output
- read_file(path: str) -> str: Read a file from the workspace
- write_file(path: str, content: str) -> bool: Write content to a file
- search_web(query: str) -> list[dict]: Search the web and return results
"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Checkpoint 1: cache system prompt
},
{
"type": "text",
"text": TOOL_DEFINITIONS_TEXT,
"cache_control": {"type": "ephemeral"} # Checkpoint 2: cache tools on top
}
],
messages=[
{"role": "user", "content": "Write a function to parse CSV files."}
]
)
The TTL clock starts from the last cache hit, not the first cache write. A request that arrives 4 minutes 59 seconds after the last hit resets the TTL to a fresh 5 minutes. Design high-frequency agent loops to stay within the TTL naturally; for batch jobs with gaps, use the 1-hour TTL and issue a warm-up call before your batch begins.
GPT-4o: automatic caching, what triggers it, what does not
OpenAI's approach to prompt caching is the opposite of Anthropic's — it is entirely automatic. There is no cache_control parameter, no explicit cache creation step, and no TTL you can configure. The provider caches prompt prefixes transparently and bills cached tokens at 50% of the normal input price.
The cache triggers on prefix matches of 1,024 tokens or more. Requests shorter than 1,024 tokens will never hit cache. Requests where the first 1,024 tokens differ from any cached prefix will also miss. OpenAI does not guarantee cache hits — the cache is best-effort and cannot be explicitly controlled.
This design simplifies your code significantly: there are no SDK changes required. The trade-off is that you cede control, and you cannot inspect cache hit rates in the API response with the same granularity as Claude.
The practical implication is that you must be disciplined about prompt structure. The stable prefix — your system prompt, any fixed context, tool definitions — must always come first and must never change between calls that you want to share a cache entry. Here is a pattern that maximises cache hit rate for GPT-4o:
from openai import OpenAI
client = OpenAI()
# IMPORTANT: Stable content MUST come first, dynamic content MUST come last.
# Any change to content before the 1,024-token mark invalidates the cache.
SYSTEM_PROMPT = """You are a senior product manager assistant with deep expertise in
B2B SaaS. Your role is to help analyse customer feedback, prioritise feature requests,
and draft product specifications.
[... extended stable context, personas, framework definitions, style guides ...]
"""
def query_product_assistant(user_question: str, additional_context: str = "") -> str:
"""
Structure: stable system prompt first, then any fixed session context,
then the dynamic user question LAST. Never prepend timestamps or user IDs
to the beginning of messages — this breaks cache prefix matching.
"""
messages = [
# System prompt: large, stable, never changes between calls
{"role": "system", "content": SYSTEM_PROMPT},
]
# Fixed session context (e.g. current sprint goals) — still before the query
if additional_context:
messages.append({
"role": "user",
"content": f"Session context: {additional_context}"
})
messages.append({
"role": "assistant",
"content": "Understood. I'll keep this context in mind for your questions."
})
# Dynamic user query — always appended LAST
messages.append({"role": "user", "content": user_question})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=1024,
)
# Usage shows cached_tokens when a cache hit occurred
usage = response.usage
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
print(f"Cache hit: {cached} tokens at 50% price")
return response.choices[0].message.content
# Usage
answer = query_product_assistant(
"What are the top three friction points from last month's NPS survey?"
)
A common GPT-4o mistake is injecting a timestamp or session ID into the system prompt for logging purposes — something like "Current time: {datetime.now()}". This changes the very first tokens of the prompt on every single call, guaranteeing a cache miss every time. Move any per-request metadata to a user message at the end of the conversation, never to the system prompt.
Gemini: Context Caching API, TTL, and cost
Google's approach to prompt caching is explicit like Claude's, but at a higher level of abstraction. Rather than marking individual content blocks, you create a named cache object via the API before your first request. Subsequent requests reference the cache by name, and the cached tokens are billed at a lower rate than regular input tokens. The TTL is configurable when you create the cache.
This design is particularly well-suited for multi-user workloads where many different users query against the same large document corpus. Create the cache once, then fan it out across thousands of user queries — each query pays only for the dynamic portion plus the reduced cached-token rate.
import google.generativeai as genai
from google.generativeai import caching
import datetime
genai.configure(api_key="YOUR_API_KEY")
# Step 1: Create the cache with your large, stable content
# This is done ONCE — the cache object is reused across many requests
LARGE_DOCUMENT_CORPUS = """
[Your large, stable document corpus — product documentation,
knowledge base, regulatory text, codebase context, etc.
Should be at least a few thousand tokens to justify caching overhead.]
"""
cache = caching.CachedContent.create(
model="models/gemini-1.5-flash-001",
display_name="product-knowledge-base-v3",
system_instruction="You are a helpful assistant with access to our product documentation.",
contents=[LARGE_DOCUMENT_CORPUS],
ttl=datetime.timedelta(hours=1), # Keep alive for 1 hour
)
print(f"Cache created: {cache.name}")
print(f"Cache expires: {cache.expire_time}")
# Step 2: Use the cache in your requests
# Reference the cache by name — cached tokens billed at reduced rate
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
# Each call to generate_content pays only for the dynamic query tokens
# plus the lower cached-token rate for the corpus
response = model.generate_content(
"What does the documentation say about rate limits?"
)
print(response.text)
# You can reuse the same cache object for many different user queries
response2 = model.generate_content(
"How do I authenticate with the API?"
)
print(response2.text)
# Step 3: When done, delete the cache to stop incurring storage costs
cache.delete()
For production systems, store the cache name in your application state rather than creating a new cache on every cold start. Implement a simple check: if the cache exists and has not expired, reuse it; otherwise create a new one and update your stored reference.
import google.generativeai as genai
from google.generativeai import caching
import datetime
import json
import os
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
CACHE_STATE_FILE = "/tmp/gemini_cache_state.json"
def get_or_create_cache(document_corpus: str, cache_display_name: str):
"""
Returns a valid CachedContent object, reusing an existing cache
if it is still within its TTL, or creating a new one if needed.
"""
# Check for an existing, non-expired cache
if os.path.exists(CACHE_STATE_FILE):
with open(CACHE_STATE_FILE) as f:
state = json.load(f)
try:
existing = caching.CachedContent.get(state["cache_name"])
# Cache still valid — reuse it
return existing
except Exception:
pass # Cache expired or not found — fall through to create
# Create a new cache
cache = caching.CachedContent.create(
model="models/gemini-1.5-flash-001",
display_name=cache_display_name,
contents=[document_corpus],
ttl=datetime.timedelta(hours=1),
)
# Persist the cache name for future calls
with open(CACHE_STATE_FILE, "w") as f:
json.dump({"cache_name": cache.name}, f)
return cache
# Usage
corpus = open("knowledge_base.txt").read()
cache = get_or_create_cache(corpus, "knowledge-base-prod")
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
response = model.generate_content("How does the refund policy work?")
Cross-provider cost comparison
The table below shows estimated costs (as of June 2026) for a typical RAG workload: 100,000 input tokens per request (of which 90,000 are stable document context) and 500 output tokens. Prices are indicative — always verify current pricing on the provider's official pricing page before committing to architecture decisions.
| Provider / Model | Fresh input (per MTok) | Cached input (per MTok) | Saving | Cost at 100k tokens (cached) | Cost at 1M tokens/day (cached) |
|---|---|---|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $0.30 | 90% | ~$0.03 | ~$0.30 |
| GPT-4o | $2.50 | $1.25 | 50% | ~$0.13 | ~$1.25 |
| Gemini 1.5 Flash | $0.075 | $0.019 | ~75% | ~$0.002 | ~$0.019 |
| Gemini 1.5 Pro | $1.25 | $0.3125 | 75% | ~$0.031 | ~$0.31 |
A few observations worth drawing out:
- Claude's 90% reduction is the headline number, but Gemini Flash's absolute cost is lowest even before caching — the right choice depends on your quality requirements.
- GPT-4o's 50% saving is the floor across providers. Its automatic caching requires the least engineering effort to implement.
- At scale, the differences compound quickly. A workload hitting 10M cached tokens per day saves $270/day on Claude vs. $125/day on GPT-4o vs. $190/day on Gemini 1.5 Pro — purely from caching, with no model change.
- Cache write costs are not zero. Factor them into your economics, especially for workloads with low request volume per cached corpus (few cache reads per write).
If you are running a mixed workload — some long-context RAG queries, some short-turn chat — do not apply a single caching strategy across the board. Cache hit rate for short messages is low (little stable prefix to reuse), so the cache write overhead may cost more than the cache read saves. Measure your actual token distribution before enabling caching everywhere.
Production patterns: RAG, agents, multi-turn
Knowing the mechanics is one thing — knowing which architectural pattern fits your workload is another. Here are three concrete patterns that deliver the highest cache hit rates in production.
Pattern 1 — RAG system caches
The classic RAG use case: a large document corpus is retrieved and prepended to the prompt, then a question is asked against it. The corpus changes slowly (daily re-indexing, at most); the query changes every request. Cache the corpus, vary only the query.
import anthropic
client = anthropic.Anthropic()
def rag_query(document_corpus: str, user_query: str) -> str:
"""
RAG query with prompt caching.
The document corpus is cached after the first call.
Subsequent calls with the same corpus pay 10x less for those tokens.
The user query is always processed fresh (it changes every call).
"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": (
"You are a document Q&A assistant. Answer questions based only "
"on the documents provided. Cite your sources by document title "
"and section. If the answer is not in the documents, say so.\n\n"
f"Documents:\n{document_corpus}"
),
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_query}
]
)
usage = response.usage
cache_hit = getattr(usage, 'cache_read_input_tokens', 0)
cache_miss = getattr(usage, 'cache_creation_input_tokens', 0)
if cache_hit > 0:
print(f"Cache HIT — {cache_hit} tokens served from cache")
else:
print(f"Cache MISS — {cache_miss} tokens written to cache")
return response.content[0].text
# Load your document corpus once per session / batch
corpus = open("retrieved_documents.txt").read()
# Multiple queries against the same corpus — only the first call
# pays full input price for the corpus tokens
answers = [
rag_query(corpus, "What are the key financial risks identified?"),
rag_query(corpus, "Who is the primary regulator mentioned?"),
rag_query(corpus, "What is the recommended remediation timeline?"),
]
Pattern 2 — Agent tool-call caches
Agentic systems typically have a large, stable system prompt defining the agent's role, reasoning approach, and output format, plus a set of tool definitions. Both are excellent cache candidates — they are fixed for the duration of a job, often running to 5,000–20,000 tokens before the actual task context is added.
import anthropic
import json
client = anthropic.Anthropic()
# These are fixed for the entire agent run — cache both
AGENT_SYSTEM_PROMPT = """You are an autonomous software engineering agent.
Your role is to complete coding tasks by using the available tools.
Reasoning approach:
1. Analyse the task requirements thoroughly before writing any code
2. Break complex tasks into discrete, testable subtasks
3. Write code incrementally, testing each component before proceeding
4. When you encounter an error, diagnose the root cause before attempting a fix
5. Always verify your changes produce the expected output before declaring completion
[... extended agent instructions, style guides, safety boundaries ...]
"""
TOOLS = [
{
"name": "read_file",
"description": "Read the contents of a file",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
},
# ... more tool definitions ...
]
TOOLS_TEXT = f"Tool definitions:\n{json.dumps(TOOLS, indent=2)}"
def run_agent_turn(task: str, conversation_history: list) -> dict:
"""Single agent turn with cached system prompt and tool definitions."""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
system=[
{
"type": "text",
"text": AGENT_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache the agent instructions
},
{
"type": "text",
"text": TOOLS_TEXT,
"cache_control": {"type": "ephemeral"} # Cache the tool definitions too
}
],
tools=TOOLS,
messages=conversation_history + [
{"role": "user", "content": task}
]
)
return response
Pattern 3 — Multi-turn conversation caches
For chat applications with a large system prompt (personas, brand voice guidelines, knowledge base), cache the system prompt and let the conversation history grow. As the conversation extends, only the new messages are processed at full price; the stable system prompt continues to be served from cache.
import anthropic
client = anthropic.Anthropic()
# Large, stable persona/instructions — cache this
PERSONA_PROMPT = """You are Aria, a customer support specialist for Acme SaaS.
Your personality: patient, thorough, technically adept. You never say "I don't know"
without first searching for an answer. You escalate to a human when the issue
requires account access or billing changes.
Product knowledge:
[... 10,000 tokens of product documentation, FAQ, troubleshooting guides ...]
Escalation criteria:
- Billing disputes over £100
- Data deletion requests (GDPR/DPDP)
- Security incident reports
- Three or more failed resolution attempts
"""
def chat(conversation_history: list, user_message: str) -> tuple[str, list]:
"""
Multi-turn chat with cached system prompt.
The persona/knowledge base is cached on the first call.
Conversation history grows, but only new messages are processed fresh.
"""
new_history = conversation_history + [
{"role": "user", "content": user_message}
]
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
system=[
{
"type": "text",
"text": PERSONA_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=new_history
)
assistant_reply = response.content[0].text
updated_history = new_history + [
{"role": "assistant", "content": assistant_reply}
]
return assistant_reply, updated_history
# Usage
history = []
reply, history = chat(history, "Hi, I can't log into my account.")
reply, history = chat(history, "I've tried resetting my password twice.")
reply, history = chat(history, "The reset email isn't arriving.")
Implementing this at your company?
Browse verified AI Builders with LLM infra and cost optimisation expertise on AI Tech Connect — or add your own profile if you specialise in this area.
Browse Builders →Common mistakes that kill your cache hit rate
Caching is straightforward in principle but easy to break in practice. These are the most common failure modes, in order of how often they appear in production codebases.
Inserting a timestamp, request ID, user name, or session variable into the system prompt before the cacheable content invalidates the cache on every single request. The prefix changes, so there is no prefix match. Move all per-request dynamic data to user messages at the end of the conversation, never to the system prompt header.
Cache prefix matching starts from the beginning of the prompt. If you prepend the user's query or any variable data before your large static context, the variable portion breaks the prefix and the static context never gets a cache hit. Order: system_prompt → tool_definitions → document_context → [CACHE CHECKPOINT] → user_query.
A trailing space, an extra newline, or a different quote style in your system prompt string will produce a different prefix and a cache miss. Use a single canonical constant for your system prompt rather than building it dynamically. If you must build it programmatically, strip and normalise whitespace before caching.
If your workload is bursty — high volume for 10 minutes, then silence for 20 — the cache may expire between bursts, and the first request of each burst pays cache write pricing. For Claude's 5-minute TTL, a warm-up call (or a lightweight "ping" with the same prefix) at the start of each burst keeps the cache warm and protects the high-volume requests that follow.
If your "stable" context is actually updated every few minutes (live pricing data, real-time inventory), you are paying cache write overhead without collecting cache read savings. Only cache content with a natural lifespan longer than your TTL. Separate stable content (product catalogue structure, UI copy) from dynamic content (current prices, stock levels) and cache only the former.
Prompt caching only saves money when you are actually hitting the cache. Instrument your calls to log cache_read_input_tokens vs cache_creation_input_tokens (on Claude) or prompt_tokens_details.cached_tokens (on OpenAI). A cache hit rate below 80% on a RAG workload is a signal that your prompt structure needs attention.
For a broader view of LLM cost strategies beyond caching — including model routing, output compression, and batching — see the LLM cost optimisation overview. For the RAG architecture that these caching patterns sit inside, see the production RAG guide. If you are evaluating whether to self-host vs. use the API in the first place, the self-host vs. API guide covers the full trade-off analysis.
And if you want to learn from Builders who have implemented these optimisations in production, browse the AI Tech Connect Builder directory — or follow our AI news feed for new provider announcements as pricing and caching APIs continue to evolve.