Most teams building on LLM APIs discover the same uncomfortable truth around the time their product starts getting real usage: the cost curve is not flat. Every user who asks a question incurs an API call. Every API call burns tokens. At 10 users, the bill is negligible. At 10,000 users, it is a spreadsheet item. At 100,000 users, it is a board discussion.

What surprises most teams is not that costs scale — of course they do — but that a significant fraction of those API calls are asking questions that have already been answered. The same support question, slightly reworded. The same code pattern, in a different file. The same onboarding query, from the forty-seventh new user. Every one of those is a full API round trip with the cost and latency of a fresh generation, when the answer already exists.

Caching is not a new idea. Every web engineer has built HTTP caches, database query caches, CDN edge caches. Applied to LLM APIs, the same principle compounds across three distinct tiers — and each tier catches what the previous one missed. Done properly, the result is a 70–90% reduction in API spend with no meaningful quality loss, and often a measurable latency improvement too.

This guide covers the full three-tier stack, with production hit rate benchmarks, a worked cost model, and honest notes on where each tier falls short.

Why Most Teams Skip Caching Until It Hurts

In the prototype phase, caching feels like premature optimisation. You have five users, you are iterating on the prompt, and adding a Redis layer feels like infrastructure theatre. The prompt changes every day anyway, which would invalidate a cache immediately. So caching gets deferred.

The problem is that caching discipline is significantly harder to retrofit than to build in from the start. Adding a cache after you have a live product means instrumenting every LLM call, auditing your prompt structure for cache-friendliness, and deploying infrastructure into a production system that is already under load. Teams who defer this until the bill becomes painful are doing the hardest possible version of the work.

The practical threshold for caching to be worth the engineering effort: above roughly 500 LLM API calls per day, some form of caching will have a positive ROI within the first month. Below that threshold, focus on product. Above it, caching should be in your infrastructure roadmap immediately.

Tier 1 — Exact-Match Caching: The Cheapest Cache Hit You Will Ever Get

Exact-match caching is the simplest form: hash the entire prompt (system prompt plus user message, normalised for whitespace), look it up in a key-value store, and return the stored response if found. If not found, call the API, store the response, return it. The cache key is the hash; the cache value is the full API response.

The hit rate depends entirely on workload type. For FAQ bots, customer support tools, and document-based Q&A systems where users repeatedly ask the same set of questions, exact-match hit rates of 45–80% are consistently achievable in production. The same question about pricing, the same request to summarise a standard document, the same query about business hours — these recur at high frequency and benefit directly from exact-match caching.

For open-ended generative workloads — creative writing, brainstorming, personalised recommendations — exact-match hit rates fall to near zero. Every prompt is unique by design. Exact-match caching still earns its infrastructure cost in these workloads if you have a fixed, reusable system prompt, but the user-message half of the cache key will rarely collide.

Setting Up LiteLLM with Redis for Exact-Match Caching

LiteLLM is the most popular open-source LLM gateway and the de facto standard starting point for exact-match caching in Python-based AI products. It provides an OpenAI-compatible proxy layer with built-in Redis caching, cost tracking, and provider switching — all configurable via a YAML file.

A minimal Redis-backed exact-match cache configuration in LiteLLM looks like this:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    ttl: 3600          # seconds; tune per workload
    supported_call_types:
      - acompletion
      - completion

With this configuration, any two requests that produce an identical cache key will return the cached response. LiteLLM handles key construction, TTL expiry, and cache miss fallthrough automatically. For teams on AWS, ElastiCache for Redis in the same VPC as your API gateway eliminates network latency from the cache lookup — typically sub-millisecond round trips versus 200–800ms for a full LLM API call.

Pro tip

Set your cache TTL based on how frequently your underlying data changes, not on technical constraints. For a product whose knowledge base is updated weekly, a 7-day TTL on FAQ responses is safe. For a product connected to live data (stock prices, news, availability), short or zero TTL is correct. A stale cache hit is worse than a cache miss — the cost of a wrong answer is higher than the cost of a redundant API call.

Tier 2 — Semantic Caching: Catching the Paraphrased Query

The limitation of exact-match caching is obvious: "What is your refund policy?" and "How do I get a refund?" are different strings that hash to different keys. In a support bot, both questions should return the same answer. Exact-match caching misses both — unless the user happens to type the exact string that was cached previously.

Semantic caching solves this by embedding the incoming query into a vector space and searching for existing cache entries that are semantically close. When a new query arrives, it is embedded using a fast, cheap embedding model (text-embedding-3-small at $0.02/MTok, or a locally-hosted sentence-transformer at near-zero marginal cost). The embedding is compared against stored embeddings using cosine similarity. If the closest match exceeds a configurable threshold (typically 0.90–0.95), the cached response for that similar query is returned.

The incremental hit rate from semantic caching — on top of exact-match — ranges from roughly 10% for open-ended chat to 70% for narrow-domain FAQ workloads. For a support bot where "refund policy", "get a refund", "money back guarantee", and "cancel order" all map to the same answer cluster, semantic caching can capture the majority of query variations without any additional API calls.

GPTCache: Semantic Caching Without Building the Pipeline Yourself

GPTCache is the most mature open-source semantic caching library for LLM applications. It handles the full pipeline: embedding the query, storing embeddings in a vector backend (Faiss locally, Qdrant or Weaviate for production), similarity search, and cache retrieval. It integrates directly with LiteLLM and with the OpenAI Python SDK via a lightweight monkey-patch.

The critical configuration decision in GPTCache is the similarity threshold. Too low (below 0.85), and you risk returning semantically adjacent but factually wrong cached responses — a "refund policy" answer served for a "return shipping label" query, for example. Too high (above 0.98), and you lose most of the benefit over exact-match caching. Production-tested starting points: 0.92 for narrow-domain support bots, 0.88 for code assistants with well-defined task types, 0.85 for internal knowledge-base tools. Build an evaluation set of known-similar and known-different query pairs and validate your threshold against it before deploying.

Watch out

Semantic caching introduces a new failure mode that exact-match caching does not have: silent wrong answers. An exact-match cache either returns the right answer or misses and falls through to the API. A semantic cache at an overly low threshold can return a confidently wrong answer with no indication that it came from the cache. Log every semantic cache hit with its similarity score during the first month of production operation, and audit samples where the score is in the 0.85–0.90 range.

Tier 3 — Provider-Level Prompt Caching: The Highest-Leverage Tier

Provider-level prompt caching operates at the infrastructure layer of the LLM provider itself, below your application code. Rather than caching complete prompt-response pairs, it caches the internal KV (key-value) representations — the intermediate activations — of a repeated prompt prefix. When the same prefix appears in a subsequent request within the TTL window, the provider skips recomputing those activations and charges a sharply reduced rate for the cached portion.

Anthropic's implementation on Claude models is the most mature and economically significant. Claude Sonnet currently charges $3.00 per million input tokens for uncached content and $0.30 per million tokens for cache reads — a 90% discount. Cache writes (the first time a prefix is stored) cost $3.75 per million tokens. The TTL is five minutes: any request that reuses a cached prefix within five minutes of the last request containing that prefix benefits from cache-read pricing.

OpenAI's prompt caching operates on a similar principle with automatic cache management — no explicit API calls required — and offers roughly a 50% discount on cached input tokens for models that support it. The cache window and TTL mechanics differ from Anthropic's implementation; consult OpenAI's current documentation for the latest parameters.

Structuring Prompts to Maximise Provider Cache Hits

The fundamental rule for provider-level caching is: put the stable content first, the dynamic content last. The cache operates on prefixes — the longest matching prefix between the current request and a cached version. If your dynamic user message comes before your system prompt in the context, no prefix caching is possible. The correct structure is:

  1. System prompt — persona, instructions, tool definitions, knowledge base. Static or near-static. This should be the first content in your context.
  2. Conversation history — previous turns, ideally cached as they accumulate. Changes only as the conversation grows.
  3. Retrieved context — RAG documents, if used. Changes per query but can be cached if the same documents appear repeatedly.
  4. Current user message — dynamic, at the end. Never cached, but the smallest part of the context.

For products with a system prompt longer than 1,024 tokens — which includes most enterprise products with detailed personas, tool definitions, or embedded knowledge — prompt caching alone, applied to the system prompt, typically cuts input costs by 50–80% on active user sessions where the five-minute TTL window is maintained by ongoing conversation traffic.

For a full treatment of the economics of falling inference costs and how they interact with prompt caching strategies, see our companion piece on AI inference costs and unit economics in 2026.

The Combined Stack: Architecture and Data Flow

The three tiers work most effectively as a sequential pipeline. An inbound query passes through each layer in order; the first cache hit wins and terminates the chain. The architecture looks like this:

Inbound request → Gateway (LiteLLM / Kong / Bifrost) → Exact-match cache (Redis) → Semantic cache (GPTCache + vector DB) → Provider API with prompt caching enabled → Response returned and stored at all applicable tiers.

In concrete terms: a user query arrives at your application. The gateway layer normalises and hashes it. If the hash matches a Redis key, the cached response is returned immediately — zero latency beyond a Redis lookup. If it misses, the query is embedded and compared against the semantic cache. If a similar entry is found above the threshold, that cached response is returned. If it misses again, the query is sent to the LLM API with the system prompt structured to maximise provider-level cache hits. The response comes back, is stored in both the semantic and exact-match caches, and is returned to the user.

Each tier is independently optional. A team with a narrow FAQ bot should start at Tier 1 alone — the exact-match hit rate will be high enough to justify the infrastructure without adding the complexity of a vector database. A team with an open-ended assistant will get minimal benefit from Tier 1 but substantial benefit from Tier 3 alone. The full three-tier stack pays off for products that have both repetitive query patterns (justifying Tiers 1 and 2) and long system prompts (justifying Tier 3).

Production Hit Rates by Workload Type

Below are representative production hit rates observed across the three tiers for common AI product workload types. These figures reflect data from production deployments; your specific numbers will depend on query distribution, system prompt length, and cache TTL settings.

Workload Tier 1 hit rate Tier 2 hit rate Tier 3 saving Combined cost reduction
FAQ / support bot 45–80% 15–30% of misses 50–80% on system prompt tokens 65–85%
Code assistant 10–25% 20–40% of misses 60–80% (large context windows) 40–55%
Document Q&A (RAG) 20–35% 25–45% of misses 40–70% (stable docs in context) 50–65%
Open-ended chat 5–15% 5–20% of misses 50–80% on system prompt 20–40%
Data extraction / classification 30–60% 20–40% of misses 70–90% (template-heavy prompts) 60–80%

Cost Model: From £/$ 500/Month to £/$ 75

A worked example makes the compound effect concrete. Suppose your product is a customer support assistant running on Claude Sonnet (current pricing: $3.00/MTok input, $15.00/MTok output). Your usage pattern is 50,000 interactions per month, each consisting of a 2,000-token system prompt, 500 tokens of conversation history, 150 tokens of user message, and 400 tokens of response.

Before caching (baseline)

Input tokens per interaction: 2,650. Output tokens per interaction: 400.

Monthly input cost: 50,000 × 2,650 / 1,000,000 × $3.00 = $397.50

Monthly output cost: 50,000 × 400 / 1,000,000 × $15.00 = $300.00

Total baseline: $697.50/month (approximately £560 at current exchange rates)

After Tier 3 only — provider-level prompt caching on system prompt

The 2,000-token system prompt qualifies for cache-read pricing on repeated calls. Assuming 70% of requests arrive within the five-minute TTL window (reasonable for an active support product during business hours):

Cache-read input cost: 50,000 × 0.70 × 2,000 / 1,000,000 × $0.30 = $21.00

Uncached input cost (30% of calls + all non-system-prompt tokens): 50,000 × [0.30 × 2,000 + 650] / 1,000,000 × $3.00 = $187.50

Cache write cost (first occurrence of prefix per TTL window): approximately $10.50 additional.

Output cost unchanged: $300.00

After Tier 3: approximately $519/month — a 25% reduction from prompt caching alone.

After all three tiers

Adding Tiers 1 and 2 (combined hit rate of 65% for a support bot — exact-match catches 50%, semantic catches another 30% of misses):

Effective API call rate: 35% of original interactions reach the API (65% served from cache).

API-bound input cost: 35% × previous Tier-3-adjusted input cost = approximately $73.

API-bound output cost: 35% × $300 = $105.

Cache infrastructure cost: Redis t3.micro ~$15/month + vector DB (Qdrant self-hosted) ~$10/month = $25/month overhead.

Total after all three tiers: approximately $203/month — a 71% reduction from baseline.

With more aggressive exact-match hit rates (achieved with 80%+ repetition workloads), the combined reduction approaches 85–90%. The output-token cost — the largest single line item at $15/MTok — falls proportionally with hit rate, which is why Tier 1 and Tier 2 have such a large impact: they eliminate entire API calls, not just discount input tokens.

Scenario Monthly spend (USD) Reduction vs. baseline
Baseline (no caching) $697.50
Tier 3 only (prompt caching) ~$519 25%
Tier 1 + Tier 3 (exact + prompt) ~$380 45%
All three tiers (65% combined hit rate) ~$203 71%
All three tiers (80% hit rate — high-repetition workload) ~$85 88%

Tools: Which Gateway to Choose

Four tools dominate the LLM caching and gateway space in 2026. Each occupies a slightly different niche.

LiteLLM is the right starting point for the vast majority of teams. It provides an OpenAI-compatible proxy, built-in Redis exact-match caching, multi-provider support (Anthropic, OpenAI, Gemini, Mistral, Bedrock, Azure OpenAI, and more), cost tracking per call, and a self-hosted dashboard. The codebase is actively maintained and the documentation is production-quality. Use LiteLLM when you want caching, provider switching, and cost visibility in a single lightweight deployment.

GPTCache is not a standalone gateway — it is a caching library that adds semantic caching on top of exact-match. It integrates with LiteLLM (via a middleware hook) or directly with the OpenAI SDK. Its vector backend is configurable: Faiss for local development, Qdrant or Weaviate for production. Use GPTCache when your workload has paraphrased query patterns that exact-match caching will miss, and when you want control over the similarity threshold and embedding model used.

Kong AI Gateway is the enterprise choice for organisations that already run Kong as their API management layer. It adds LLM-specific plugins — semantic caching, rate limiting by model, cost tracking, PII scrubbing — on top of Kong's existing gateway infrastructure. If your engineering organisation has a Kong deployment and a platform team that manages it, the AI Gateway plugins are a natural extension. If you are starting from scratch, Kong's operational overhead is higher than LiteLLM for the same caching outcome.

Bifrost is the newest of the four, positioned as a unified LLM gateway with first-class cache TTL controls, per-request cache bypass, multi-provider failover, and a clean REST API. It is cloud-hosted with a self-hosted option, targets teams that want minimal ops overhead, and is worth evaluating for greenfield projects that do not want to manage a LiteLLM instance. The semantic caching integration is less mature than GPTCache at the time of writing, but Bifrost's provider coverage and failover logic are strong.

For most bootstrapped teams and early-stage products, the recommended stack is: LiteLLM + Redis (Tier 1) with GPTCache middleware (Tier 2) and Anthropic prompt caching enabled at the API call level (Tier 3). This combination requires no additional SaaS spend beyond a small Redis instance and minimal engineering overhead beyond the initial configuration.

Optimising LLM infrastructure costs? Find Builders who have done it.

AI Tech Connect's verified Builders include ML engineers and infrastructure leads who have shipped caching stacks in production. Browse profiles to find the right expertise for your team.

Browse Builders →

Pitfalls and How to Avoid Them

Cache staleness

The most dangerous failure mode in LLM caching is returning an outdated response as if it were current. If your product's underlying knowledge changes — pricing updates, policy changes, new product features — and cache entries from before the change are still being served, users receive incorrect information with no indication that it is stale. The fix is a cache invalidation strategy aligned with your data update frequency: a webhook from your CMS that flushes relevant Redis keys on content publish, a scheduled cache clear on a cadence matching your content update frequency, or a TTL set conservatively shorter than your update interval. Never set indefinite TTLs for content that can change.

Semantic threshold misconfiguration

As noted above, setting the semantic similarity threshold too low produces silent wrong answers. Set it too high and you forfeit most of the tier's benefit. The correct approach is empirical: build a labelled test set of query pairs — queries that should return the same answer and queries that should not — and measure precision and recall at different threshold values. A threshold that gives 95% precision on your test set is a safe starting point. Accept slightly lower recall (some valid cache hits missed) in exchange for near-zero false-positive rate (wrong answers served).

Embedding model mismatch

If you change your embedding model after deploying semantic caching — switching from text-embedding-3-small to a locally-hosted sentence-transformer, for example — existing cache embeddings are now in a different vector space from new query embeddings. Similarity scores become meaningless. Treat an embedding model change as a cache invalidation event: clear the vector store and rebuild. Document your embedding model version alongside your cache entries to make this manageable.

Cost of the caching layer itself

The caching stack is not free. Redis costs money. A vector database costs money. Embedding API calls cost money. For very low-volume products (under a few hundred API calls per day), these fixed costs can exceed the variable API cost savings. Model the break-even: if your vector DB embedding call costs $0.00002 per request and your semantic cache hit rate is 20%, the embedding overhead is $0.0001 per hit, which makes sense only if the avoided LLM call costs significantly more. At current Claude Sonnet pricing on a 3,000-token interaction, the avoided call costs roughly $0.015 — three orders of magnitude larger. The maths work at almost any scale above trivial.

Provider TTL cliff

Anthropic's five-minute TTL for prompt caching means that a user who pauses a conversation for six minutes will cause the next message to be served without the cached system prompt — at full input token cost. For products with predictable conversation patterns, this is manageable: active support conversations rarely have six-minute silences. For asynchronous products (email-style interactions, background agent tasks), the TTL may expire frequently, reducing Tier 3's effective saving. For these workloads, exact-match and semantic caching (Tiers 1 and 2) carry more weight; Tier 3 saves on system-prompt tokens only when interactions cluster within the TTL window.

For broader context on how inference costs and caching strategies interact at the model and hardware level, see our piece on Gemini's $0.025/MTok pricing and what it means for cost architecture and the detailed breakdown of NVIDIA B300 inference economics for teams considering self-hosted alternatives to managed APIs.

Dual-Market Notes: India and the UK

The caching stack described here is provider-agnostic and infrastructure-agnostic, but the economics land differently depending on where you are building.

For bootstrapped Indian teams, the primary benefit of aggressive caching is extending runway. At a conversion rate of approximately 84 INR to the dollar, a $500/month API bill is roughly INR 42,000 — a significant line item for a seed-stage product. Cutting it to INR 6,300 through a three-tier caching stack is the difference between burning through your budget in 18 months versus having the headroom to iterate to product-market fit. LiteLLM and GPTCache are both free and open-source; a Redis instance on AWS Mumbai runs under INR 1,700/month. The infrastructure cost is negligible relative to the saving.

For UK scale-ups, the benefit is a different kind: predictable cost at scale. Enterprise AI products with multi-year contracts need to model infrastructure costs at 10× and 100× current volume. A caching stack that keeps effective API call rate at 35% of gross usage creates a much flatter cost curve as volume grows. Combined with the data residency benefits of keeping cached responses within UK infrastructure (no personal data in the cache crosses borders), this is both a financial and a compliance argument for building the caching layer early.