For most of 2023 and 2024, the dominant concern for AI product builders was not features — it was the bill. A conversational product running GPT-4 class inference at meaningful scale could easily spend $50,000 a month on tokens before the product reached profitability. That reality shaped an entire generation of product decisions: shorter system prompts, aggressive caching, hybrid retrieval to avoid long contexts, and the constant hunt for cheaper model tiers.
That era is ending faster than most people realise. AI inference costs are falling at approximately 95% per year — a widely cited industry estimate based on tracked per-token price reductions across major providers since GPT-4's original pricing. Compounded over two years, workloads that cost $1 in mid-2024 now cost roughly $0.05. That is a directional figure, not a precise audit, but the order-of-magnitude shift is real and changes the calculus for every product decision you are making right now.
The 95% Annual Decline: What Is Driving It
The cost collapse is not happening for a single reason. It is the product of at least four overlapping forces accelerating simultaneously.
Hardware generation leap. NVIDIA's Blackwell architecture — embodied in the B300 and its variants — delivers materially better inference throughput per dollar than the A100 generation that set the baseline for most 2023 pricing. Google's TPU 8i, announced at Google Cloud Next '26, delivers 80% better performance-per-dollar than the prior generation. Both improvements compound directly into lower cost per token for any workload running on these chips. For a detailed breakdown of NVIDIA's inference economics, see our earlier piece on the NVIDIA B300 inference economics.
Software-level efficiency. FP8 quantisation, speculative decoding, continuous batching, and flash attention have collectively halved or better the compute required for a given quality of output. These are not theoretical gains — they are production-deployed techniques on every major inference platform. Techniques like KV-cache compression are still landing: Google's TurboQuant, which cuts LLM memory by 6× with zero accuracy loss, is a recent example covered in our Google TurboQuant deep dive.
Open-weight model proliferation. Llama 4, GLM-5.1, and the recently released gpt-oss-20b have created a viable tier of self-hosted open-weight inference that did not exist two years ago. Teams willing to operate their own inference stack — on cloud GPU spot instances or on-prem hardware — can achieve costs dramatically lower than any API endpoint. The trade-off is engineering overhead, but for high-volume workloads that overhead pays for itself quickly. See our open-weight models roundup for the current state of play.
Market competition. The AI inference chip market was valued at $13.7 billion in 2025 and is projected to reach $56.9 billion by 2035 (CAGR 15.3%, per MarketGenics Global Research, May 2026). That growth is drawing capital from every direction. Fleet Data Centers recently closed a $4.6 billion senior secured notes offering for Nevada AI infrastructure. BlackRock and MGX's consortium acquired Aligned Data Centers for $40 billion. General Compute's agent inference platform is launching on 15 May 2026. All of this new supply is driving spot GPU prices to historic lows, and the competition between providers is keeping them there.
Current GPU and Inference Cost Landscape
Below is the current spot and on-demand pricing landscape as of May 2026, sourced from competitive cloud providers (RunPod, JarvisLabs, GetDeploying). Prices vary significantly by provider and availability zone — treat these as representative best-available figures, not universal market rates. "Cost/MTok" figures assume FP8 precision and a 70B-parameter model class with continuous batching enabled. Actual cost varies by model size, batch fill rate, and provider.
| GPU | Price/hr (spot) | Price/hr (on-demand) | Est. Cost/MTok (FP8) | Best for |
|---|---|---|---|---|
| L40S | $0.72 | $1.10 | $0.09 (batch) / $0.23 (OD) | Cost-optimised inference, open-weight models |
| H100 SXM | $1.85 | $2.79 | $0.18 (batch) / $0.40 (OD) | Frontier model hosting, low-latency paths |
| A100 80GB | $1.20 | $1.80 | $0.28 (batch) / $0.55 (OD) | Legacy workloads, good spot availability |
| Google TPU 8i (Cloud) | N/A (reserved) | ~$1.40 equiv. | ~$0.15 (batch) | JAX-native workloads, Google Cloud commitment |
| AWS inf2 (Inferentia) | N/A | $0.76–$1.97 | $0.20–$0.35 | Steady-state AWS-native inference, Neuron SDK |
The L40S spot tier — $0.72/hr with FP8 batching — is currently the best cost-per-token option for teams comfortable with occasional spot preemptions. For production workloads that cannot tolerate interruption, on-demand L40S at $1.10/hr with a well-tuned batching configuration still beats H100 SXM on-demand for most model sizes below 70B.
Do not benchmark GPU options on throughput alone. Benchmark them on cost-per-correct-output for your specific task. An L40S with a carefully quantised 34B model often produces better cost-per-output than an H100 running a 70B in BF16 — and the quality gap narrows to near-zero for most product use cases. Run your eval suite on both before committing.
Unit Economics 101: Calculating Cost Per User Interaction
Theory is useful. Arithmetic is more useful. Here is the calculation chain that matters for AI product profitability.
Step 1: Measure your average interaction size. For a typical conversational AI product, an interaction comprises a system prompt (500–2,000 tokens), the conversation history included in context (0–3,000 tokens), the user's message (50–500 tokens), and the model's response (200–1,000 tokens). A reasonable baseline for a mid-complexity product is 3,000 tokens per interaction: roughly 2,200 in and 800 out.
Step 2: Apply your cost rate. At $0.09/MTok batch for input and $0.18/MTok for output (output tokens cost more due to the autoregressive generation cost):
- Input cost: 2,200 tokens × $0.09 / 1,000,000 = $0.000198
- Output cost: 800 tokens × $0.18 / 1,000,000 = $0.000144
- Raw inference cost per interaction: $0.000342 (approx. $0.00034)
Step 3: Add overhead. In practice, add 20–30% for retries, structured output parsing overhead, embedding calls, and system prompt overhead that grows with feature additions. Call it $0.00044 per interaction fully loaded.
Step 4: Map to your pricing model. At that cost, 1,000 interactions costs $0.44. A user who has 50 conversations per month costs you $0.022 in inference. At a $5/month subscription price, you have a 225:1 revenue-to-inference-cost ratio. Even at $2/month, it is 90:1 — far healthier than the ratios teams were working with in 2024.
The picture changes for high-volume, low-ARPU products. If you are building a free tool funded by advertising, or a B2C product with an annual plan under $20, you need to think carefully about interaction depth and model tier selection. But for most B2B SaaS products with monthly subscription pricing above $30/seat, inference cost is no longer your binding constraint — it is product-market fit.
These economics assume good batching discipline. If your infrastructure sends each user request as an isolated call — no batching, no prompt caching, cold system prompts on every turn — your real cost may be 5–10× higher than the table above suggests. Most of the "AI is too expensive" complaints in 2024 came from teams doing exactly this. Batching and caching are not optional optimisations; they are table stakes for cost-positive products.
Architecture Choices That Affect Your Bill
The GPU price is only one input. Your architecture determines how efficiently you use it. Four choices dominate your inference bill:
Batching. Continuous batching — grouping multiple user requests into a single GPU forward pass — is the single highest-leverage optimisation available. With good batch fill rates (above 60%), you can achieve 3–5× better token throughput per dollar versus serving requests one at a time. vLLM, TGI, and SGLang all implement continuous batching. If you are self-hosting, you should be using one of these. If you are using a managed API, choose providers that explicitly state they batch — and avoid those that do not.
Quantisation. FP8 inference reduces memory bandwidth requirements significantly, allowing more model weight to fit in GPU VRAM and increasing throughput without meaningful quality degradation on most production tasks. INT4 goes further but introduces quality loss that is task-dependent — benchmark before deploying. AWQ and GPTQ are the dominant quantisation formats for open-weight models; most are available pre-quantised on Hugging Face.
KV-cache and prompt caching. If your product has a consistent system prompt — a persona, a knowledge base preamble, a set of tool definitions — caching that context dramatically reduces the compute per interaction. Anthropic's prompt cache, for example, reduces re-processing of cached tokens to $0.30/MTok versus $3/MTok uncached. Google's TurboQuant approach compresses the KV cache itself by 6×, extending how much context you can keep hot in memory. Architectural choices that make your system prompt stable and reusable pay dividends at scale.
Model routing. Not every query needs a 70B parameter model. Routing simpler, shorter queries to a 7B or 13B model — and reserving the large model for complex reasoning tasks — can cut your blended inference cost by 40–60% without measurable quality loss on the easy queries. Tools like Martian, RouteLLM, and custom classifier-based routers make this tractable to implement. The key is building the eval infrastructure to know which query classes benefit from the larger model.
For a concrete example of how these choices interact at the DeepSeek level, see our analysis of DeepSeek V4 Flash and Pro frontier cost.
India-Specific: Sovereign Inference Options
For Indian teams, inference cost is only half the consideration. Data residency is increasingly the other half, as the DPDP Act's phase two requirements take shape. Running inference on US-region infrastructure for Indian user data introduces compliance exposure that is manageable today but may become untenable as enforcement matures.
Neysa Cloud has emerged as the most credible sovereign GPU cloud for Indian inference workloads. Their GPU pricing is competitive with AWS Mumbai and Azure India Central — roughly at parity for L40S instances — with the added advantage of data residency within India and localised support. For bootstrapped teams, Neysa's spot-equivalent pricing on H100s has come in below AWS Mumbai in direct comparisons reported by early adopters.
AWS Mumbai (ap-south-1) remains the default for teams already in the AWS ecosystem. inf2 (Inferentia 2) instances offer cost-efficient inference for steady-state workloads compiled for the Neuron SDK. The engineering investment to port to Neuron is non-trivial, but for high-volume, fixed-model workloads, the economics are attractive. GPU instances (g5, p4d) are available but spot availability in Mumbai has historically been tighter than US regions.
C-DAC and National AI compute. India's National AI Mission is building out sovereign compute infrastructure under the C-DAC banner. Access is currently prioritised for academic and research institutions, but commercial access pathways are being established for 2026–2027. For startups in regulated sectors (healthcare, fintech, government tech), watching this programme and positioning for early access is worth the effort.
Google Cloud Mumbai / Hyderabad. Google Cloud's TPU 8i is not yet available in Indian regions, but GPU instances (A100, L4) are, and Google's announced commitment to Indian AI infrastructure investment suggests TPU availability in IN regions within 12–18 months. For teams building on Gemini APIs, latency routing to the nearest region is already partly managed automatically.
"We migrated our document processing pipeline from AWS us-east-1 to Neysa last quarter. Token costs were within 8% of what we were paying on AWS, but we cut our legal review cycle for enterprise clients in half because we could credibly say user data never left India. That compliance story unlocked two enterprise deals that had stalled on data residency questions."
— Karthik R., Co-founder, legal-tech startup · Bengaluru, INUK-Specific: GDPR-Compliant Inference and Where to Run It
For UK teams, the constraint is different but equally real. Post-Brexit, the UK GDPR and the Data Protection Act 2018 govern how personal data processed through inference must be handled. The core question: if user queries contain personal data — names, contact details, health information, financial data — where can that inference legitimately run?
UK and EU data residency is the safest path. Azure UK South (London), AWS eu-west-2 (London), and Google Cloud europe-west2 (London) all provide inference infrastructure with data residency in the UK. Pricing in UK regions typically runs 10–20% above equivalent US-region pricing, but for enterprise products where a data processing agreement (DPA) is a prerequisite for the sale, this premium is trivially justified.
Standard Contractual Clauses for US inference. Many UK SMBs and startups currently run inference on US-region APIs (OpenAI, Anthropic, etc.) under SCCs. This is legally viable but is coming under increasing scrutiny in regulated sectors. If your product touches healthcare, financial services, or legal data, assume that your enterprise customers' legal teams will eventually require EU/UK residency — and design your inference architecture to support it from day one.
Self-hosted open-weight models on UK-region GPU instances are increasingly the answer for UK enterprise AI teams. Running Llama 4 or a fine-tuned Mistral variant on spot H100 instances in eu-west-2 gives you both data residency and cost control. The Llama 4 deployment guide at our Llama 4 HuggingFace deployment guide covers the operational specifics for a production self-hosted stack.
Purpose limitation and inference logs. GDPR's purpose limitation principle has implications beyond data residency. Inference logs containing user queries may constitute personal data. UK teams should review their log retention policies and ensure that inference providers' log retention settings align with their stated privacy policies. Most major inference APIs now offer log-free or zero-retention modes — use them for user-facing products.
What to Do Now: A Profitability Checklist
Whether you are a two-person bootstrapped team in Bengaluru or a twelve-person AI squad inside a Manchester enterprise, the actions that move the needle are the same:
- Measure your current cost per interaction. Instrument your inference calls to log token counts and costs. If you do not know your current cost per interaction, you cannot improve it. Most teams are surprised — in both directions — by their first measurement.
- Enable batching and prompt caching immediately. If you are using a managed API, turn on prompt caching. If you are self-hosting, verify that continuous batching is active in your serving framework. These two changes alone typically cut costs by 30–60%.
- Benchmark open-weight alternatives on your eval suite. Pick your top three use cases. Run Llama 4 Scout or Mistral Devstral 2 on them. Measure quality delta versus your current model. For many tasks, the quality gap is smaller than you expect, and the cost gap is large. See the open-weight models roundup for current model quality benchmarks.
- Implement model routing for query complexity. Classify your incoming queries by complexity — simple lookup, moderate reasoning, complex multi-step. Route simple queries to a smaller, cheaper model. Even a crude binary router (small/large) can cut blended inference costs by 30–50%.
- Apply FP8 quantisation to self-hosted models. If you are running open-weight models, ensure you are using FP8 or INT8 precision. The quality delta from BF16 is minimal for most production tasks; the cost saving is significant.
- Model your unit economics at 10× and 100× current volume. Build a simple spreadsheet: tokens per interaction × cost per token × interactions per user per month × users. Run it at current scale, at 10× scale, and at 100× scale. Identify at which scale inference becomes the dominant cost, and plan your architecture evolution accordingly.
- Address data residency now, not later. For Indian teams: check whether your user data can credibly stay in India on current infrastructure. For UK teams: verify your inference provider's data processing agreement covers your use case. Both are easier to fix before your first enterprise sale than after.
The window for building AI products with sustainable unit economics is open right now and is widening. The AI inference chip market's explosive growth — from $13.7 billion in 2025 to a projected $56.9 billion by 2035 — means that supply will keep increasing, prices will keep falling, and the advantage will accrue to teams who have built the infrastructure discipline to capture those savings automatically as they land.
Build profitable now. The infrastructure is ready for it.
Working on inference architecture? Connect with Builders who have solved it.
AI Tech Connect's verified Builders include infra engineers and ML platform leads who have shipped inference at scale. Browse profiles to find the right expertise.
Browse Builders →