What you need to know
- Cheaper tokens, bigger bills. Per-token inference costs have fallen roughly 1,000x over three years for comparable capability (a widely reported estimate). Usage-based pricing means the savings rarely reach the invoice — demand grows to fill the headroom.
- Inference is the spend. It reportedly consumes around 80–90% of total compute dollars over a model's production life; one estimate puts inference at roughly $15–20 billion for every $1 billion spent on training.
- Utilisation is the elephant. A widely cited figure puts average GPU utilisation near 5%, framing a roughly $401 billion idle-infrastructure problem. Most deployed capacity simply sits there, metered and idle.
- Hardware keeps moving. New A5X bare-metal instances on NVIDIA Vera Rubin NVL72 rack-scale systems are claimed to deliver up to 10x lower inference cost per token and 10x higher throughput per megawatt versus the prior generation, with commercial Vera Rubin shipments expected in the second half of 2026.
Before you touch a single model parameter, instrument spend by task. Tag every request with a task label and log input tokens, output tokens, retries and tool calls. Within a week you will find that a small number of task types drive most of your bill — and those are the only ones worth optimising hard.
Measure cost-per-task, not cost-per-token
Per-token pricing is the number on the rate card, but it is the wrong unit for a FinOps decision. A production task is rarely one prompt and one completion. It is a system prompt, retrieved context, the model's reasoning, one or more tool calls, occasional retries on malformed output, and sometimes a second model pass to grade the first. Optimise the token and you save a rounding error; optimise the task and you save real money.
Consider an Indian SaaS team in Bengaluru running a customer-support assistant. The headline rate looks trivially cheap. But each resolved ticket triggers a retrieval call, a reasoning pass, a tool call to the order system, and — for roughly one in eight tickets — a retry because the first structured response failed validation. Counted end to end, the cost-per-resolved-ticket is several times the cost-per-token figure the team first quoted to its board. The lesson is not that inference is expensive; it is that the unit of measurement was wrong.
A UK fintech in London hits the same wall from a different angle. Its fraud-triage agent runs a multi-step loop — classify, enrich, re-classify, escalate — and each step is a separate model call. Cost-per-token is flat across the loop; cost-per-decision varies tenfold depending on how many loop iterations a case needs. Once the team charted cost-per-decision by case complexity, the optimisation became obvious: cap the loop, and route simple cases to a cheaper model entirely.
The demand-elasticity trap is real. Cheaper tokens do not automatically mean cheaper bills. When you cut unit cost, internal teams ship more AI features, raise context windows, and add agentic loops — and total usage can grow faster than price falls. Forecast demand alongside price, and put soft caps and per-team budgets in place before you celebrate a rate cut.
Fix utilisation before you tune anything
The biggest line item in most self-hosted inference setups is not the model — it is the GPU you are paying for while it sits idle. With average utilisation cited around 5%, the implication is blunt: nineteen out of twenty rupees or pence spent on capacity may be buying nothing. You generally pay for the GPU-hour whether the card is saturated or asleep, so the first and cheapest win is to keep it busy.
Three moves do most of the work. First, batch — group concurrent requests so each forward pass serves many users; continuous (in-flight) batching on a modern serving stack can lift throughput several-fold without new hardware. Second, consolidate replicas — teams routinely run one model per service when a shared, well-batched endpoint would serve them all. Third, scale to demand — autoscale replicas down off-peak and lean on the time-zone reality that an Indian team's quiet hours are a UK team's busy ones, so a follow-the-sun deployment can share capacity across both markets.
The GPU price ladder
Right-sizing the hardware is the next lever. You do not need flagship silicon to serve a quantised mid-size model, and you should not pay flagship rates for batch jobs that can tolerate interruption. Use this rough ladder as orientation — actual rates vary by provider, region and commitment, so treat every figure as an estimate to verify, not a quote.
| Tier | Typical purchase mode | Indicative $/GPU-hr | Best for |
|---|---|---|---|
| H100 SXM (specialised AI clouds) | On-demand | ~$2.00 and up | Steady mid-to-large serving baseline |
| Blackwell B300 (early cloud) | Spot / interruptible | ~$2.90 | Batch jobs, offline evals, fine-tuning |
| Blackwell B300 (early cloud) | On-demand range | ~$4.95–$18.00 | Latency-critical flagship inference |
| Vera Rubin NVL72 (A5X, H2 2026) | Bare-metal / rack-scale | Claimed up to 10x lower cost-per-token | Very high-throughput production at scale |
The spread is the point. A batch re-embedding job that runs overnight has no business on an $18/hr on-demand flagship when interruptible Blackwell capacity is roughly $2.90/hr and the job does not care if it pauses. Conversely, a customer-facing endpoint with a tight latency budget should not be on spot at all — an eviction mid-request is a failed user interaction, not a cost saving.
Blend three purchase modes. Size a committed-use baseline to your floor of steady traffic for the lowest hourly rate; burst into spot/interruptible for anything that can retry or wait; and keep a thin on-demand layer for unpredictable spikes. The blend, not any single mode, is what gets the average cost down.
The software levers: caching, quantisation, right-sizing, routing
Once the hardware is right-sized and busy, four software levers compound on top.
Caching
Caching is the highest-return, lowest-risk lever in the box. Prompt-prefix caching means a stable system prompt and retrieved context are charged once and reused across turns, often at a fraction of fresh-input pricing. Response caching for deterministic or near-deterministic queries — FAQ-style lookups, repeated classifications — can take whole categories of request off the model entirely. For the Bengaluru SaaS team, caching the shared system prompt and product knowledge across a session cut input-token spend on the support assistant by a large margin, because the same 8,000-token preamble was no longer re-billed on every turn.
Quantisation
Quantising a model from 16-bit to 8-bit, or to 4-bit for the right workloads, shrinks the memory footprint and lets the same model serve on cheaper, smaller GPUs — sometimes collapsing a multi-GPU deployment onto a single card. The trade-off is a measurable but often acceptable accuracy drop. The discipline is to quantise, then re-run your task-level evaluation suite: if cost-per-task falls and quality holds within tolerance, ship it; if quality slips on the tasks that matter, keep full precision for those and quantise the rest.
Right-sizing the model
The most expensive habit in production is routing every request to the largest available model "to be safe". Most tasks do not need a frontier model. Classification, extraction, routing, short rewrites and routine summarisation are well within reach of small open-weight models that you can serve cheaply on commodity GPUs. Reserve the flagship for genuinely hard reasoning.
Cheap-versus-expensive routing
Routing operationalises right-sizing. A lightweight classifier — or even a cheap small model acting as a triage layer — inspects each request and sends easy cases to a cheap model and hard cases to an expensive one. The UK fintech's fraud-triage agent is the canonical case: route the clear-cut transactions to a small in-house classifier, and escalate only the genuinely ambiguous ones to a frontier model. The cheap path handles the majority of volume at a fraction of the cost, and the expensive path is reserved for where it earns its keep.
Build the router around an explicit confidence threshold, not a vibe. Have the cheap model return a calibrated confidence score; escalate to the expensive model only below the threshold. Tune the threshold against your task-level evals so you can dial the cost-quality trade-off as a single number your finance team can reason about.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Browse Builders →A worked example: the same workload, two bills
Picture a workload of one million support tasks a month, each averaging roughly 9,000 input tokens (system prompt plus retrieved context) and 700 output tokens. Take an illustrative blended rate of $5 per million input tokens and $15 per million output tokens — round numbers for the arithmetic, not a quote.
The naive setup routes every task to a frontier model, re-bills the full system prompt on each call, and runs on always-on on-demand GPUs at low utilisation. Input is 9 billion tokens, output is 700 million tokens: roughly $45,000 in input plus $10,500 in output, around $55,500 a month before the idle-GPU waste on top.
Now apply the playbook. Prefix-cache the 8,000-token preamble so only the ~1,000 dynamic input tokens are billed fresh per task; route the ~70% of easy tasks to a small model at roughly a tenth of the unit cost; and lift utilisation so the GPU baseline is committed-use rather than idle on-demand. The dynamic input collapses to about 1 billion fresh tokens, the expensive model now sees only the hard 30%, and the cached prefix is billed at a steep discount. The model spend falls to a small fraction of the naive figure — comfortably an order of magnitude — and the utilisation fix removes most of the idle-capacity overhead separately. Same workload, same quality target, a fundamentally different bill.
Watch the hardware curve — but do not wait for it
The hardware story is genuinely encouraging. A5X bare-metal instances on NVIDIA Vera Rubin NVL72 rack-scale systems are claimed to deliver up to 10x lower inference cost per token and 10x higher token throughput per megawatt versus the prior generation, with commercial Vera Rubin shipments expected in the second half of 2026. For very high-throughput production, that curve is worth tracking. But hardware that lands later this year does nothing for the bill you are paying this quarter. The levers above are available today, on the GPUs you already rent, and they compound — which is why the FinOps work, not the hardware roadmap, is where the savings live right now.
The playbook, in order
- Measure cost-per-task. Instrument spend by task type, including retries and tool calls. Optimise the few task types that dominate.
- Fix utilisation. Batch, consolidate replicas, autoscale off-peak, and share capacity across IN/UK time zones before tuning anything.
- Right-size the hardware. Use the GPU ladder; never run batch jobs on flagship on-demand rates.
- Cache aggressively. Prefix-cache stable context; response-cache deterministic queries.
- Quantise where quality holds. Re-run task-level evals after every quantisation step.
- Route cheap-versus-expensive. Confidence-threshold routing keeps the flagship for the hard minority.
- Blend purchase modes. Committed baseline, spot for batch, thin on-demand for spikes.
- Forecast demand. Cheaper tokens invite more usage — cap and budget so unit savings reach the invoice.
The teams winning on inference economics in 2026 are not the ones with the cheapest tokens. They are the ones who measure the right unit, keep their hardware busy, and route ruthlessly — in India and the UK alike.