What's actually happening
For most of the current AI cycle, the default answer to "what runs your model?" has been a single word: Nvidia. That is now loosening. Reporting indicates that three high-profile operators — Midjourney, Anthropic and Meta — are migrating inference workloads from Nvidia GPUs toward Google's tensor processing units. The same reporting attributes cost cuts of roughly 65% to the move.
That 65% figure deserves an immediate flag, and we will keep flagging it through this piece because it is the number most likely to be misused. It comes from secondary reporting, not from a primary disclosure by any of the three companies. Nobody has published an audited before-and-after. So read it as "cost cuts of roughly 65%, according to reporting" — an indicative figure for very large, heavily optimised deployments, not a verified guarantee and certainly not a number that transfers to an arbitrary workload.
With that caution in place, the underlying trend is real and worth understanding. Two things are genuinely true and can be stated plainly. First, inference cost now dominates the lifetime economics of any deployed model — training is a one-off capital event, inference is a recurring operating bill that grows with every user request. Second, GPU utilisation in many enterprise deployments is low; expensive accelerators sit idle between bursts of traffic. When a cost line is both large and inefficiently spent, large operators go looking for alternatives. That is what this story is: not the abandonment of Nvidia, but the diversification of inference silicon.
Why inference, and why now
The economics have shifted under the industry's feet. In 2023, the spending conversation was dominated by training: assembling the clusters, paying for the runs, racing for capability. Inference was the smaller, quieter line. That has inverted.
| AI infrastructure spend | 2023 | Early 2026 |
|---|---|---|
| Inference share of spend | ~33% | ~55% |
| Cost profile | Training-led, capital-heavy | Inference-led, recurring |
| What moves the bill | How big a model you trained | How many requests you serve, and how efficiently |
Inference now represents about 55% of AI infrastructure spending in early 2026, up from roughly 33% in 2023. Once the recurring bill is the majority of total spend, a percentage saving on inference is worth more than almost any one-off training optimisation. That is the structural reason a silicon shift is suddenly front-page infrastructure news rather than a procurement footnote.
It also explains Google's pitch. Google reports that its TPU 8i delivers about 80% better performance-per-dollar for inference than the prior TPU generation. That is Google's own claim — attribute it accordingly — but it is the right metric to be competing on. Performance-per-dollar, not raw throughput, is what determines whether serving a model is profitable. A chip that is slightly slower but materially cheaper per useful token can win the inference workload outright, even when it would lose a training benchmark.
The roughly 65% saving will not transfer to your workload. That figure, per reporting, describes frontier-scale operators with bespoke serving stacks, steady high-volume traffic and dedicated infrastructure teams. A small startup with bursty traffic, an off-the-shelf serving framework and idle accelerators between requests starts from a completely different baseline. Treat the 65% as proof the direction is real — not as a forecast for your own bill.
What this means if you are not Anthropic-scale
Here is the honest core of this piece. Most Indian and UK builders reading this are not running inference at frontier scale. You are a startup, an agency or a product team with a cloud bill of perhaps a few thousand dollars a month, of which inference might be a meaningful slice but not a nine-figure cost centre. So what does a frontier-lab silicon shift actually mean for you?
The good news first: you do not need an Anthropic-style contract to touch this. TPUs are available on demand through Google Cloud, in the same way GPUs are available through any hyperscaler. Access is not the barrier. The barrier is software, and it is worth being concrete about it.
The mainstream GPU serving ecosystem is built on CUDA. The high-performance engines most teams reach for — covered in our comparison of vLLM, SGLang and TensorRT-LLM — assume that substrate. TPUs do not run CUDA. They run on Google's XLA compiler, and the most fluent path to them is JAX. Some PyTorch workloads can reach TPUs through PyTorch/XLA, but the experience is less mature than the native GPU path, and custom CUDA kernels do not come along for free. So the real question is not "are TPUs cheaper?" — it is "what does it cost me, in engineering time, to make my model run well on a non-CUDA accelerator, and is that cost smaller than the saving?"
For a managed-API customer, the answer is usually "this is not your problem yet". If you consume models through an inference provider — and we compared the field in our piece on inference platforms DeepInfra, Together, Fireworks, Groq and Cerebras — the silicon underneath is the provider's concern. Their margins improve if they adopt cheaper accelerators, and competition should pass some of that through to your per-token price over time. You benefit indirectly, without porting anything.
Before evaluating any new silicon, fix the cheap wins first. Caching, request batching and model routing routinely cut inference bills by a large margin with zero hardware change — we wrote a full playbook in how to cut LLM API costs with caching, batching and routing. A team still serving every request uncached and unbatched has more to gain from a week of software work than from a hardware migration. Sort the software, then look at the chip.
Should you evaluate TPUs? A decision matrix
The question is not binary. It depends on the shape of your workload and your team. The matrix below is how we would frame the call.
| Your situation | Verdict | Why |
|---|---|---|
| You consume models via a managed API | Ignore for now | Silicon is the provider's problem; you benefit through price competition without porting |
| Self-hosted, inference under ~$2k/month | Ignore for now | Engineering time to port a non-CUDA stack will likely exceed the saving |
| Self-hosted, inference $10k+/month, bursty traffic | Fix utilisation first | Idle accelerators waste money on any silicon; batching and autoscaling come before a migration |
| Self-hosted, steady high-volume traffic, mainstream architecture | Worth a benchmark | This is the profile where TPU performance-per-dollar can genuinely pay back the porting cost |
| You depend on custom CUDA kernels or niche ops | Proceed with caution | Framework lock-in is real; budget significant time to re-implement on XLA paths |
One regional caveat that applies on both sides of our market: hyperscaler accelerator availability differs by region. The newest TPU generations land in a handful of Google Cloud regions first, and those are not always the regions closest to Indian or UK users. A UK team with data-residency obligations, or an Indian team optimising for latency to domestic users, may find the cheapest silicon is not yet available in a region it can legally or practically use. Check region availability before you build a business case on a headline price — the chip you benchmarked in one region may not be the chip you can deploy in another.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →The framework lock-in trap
If there is one mistake to avoid, it is treating accelerator choice as a one-way door. The reason large operators can diversify silicon is partly that they invested early in keeping their serving stack portable — abstracted behind interfaces, not welded to one vendor's kernels. Most smaller teams have not done that, and discover the dependency only when they try to move.
The practical defence is to keep the migration question cheap to ask. Where you can, depend on serving frameworks and model formats that already support multiple back ends, rather than hand-optimised paths that assume one vendor. Keep your model architecture mainstream — exotic custom operations are exactly what fails to port cleanly to a new compiler. None of this means adopting TPUs today. It means making sure that if the economics shift further, switching is a tractable project rather than a rewrite. Optionality is the asset; the specific chip is not.
It is also worth being clear-eyed about what the frontier-lab story does and does not tell you. It confirms that the inference-silicon market is becoming genuinely competitive, which is good news for every buyer — more competition should mean better performance-per-dollar across the board, whoever you buy from. It does not tell you that you personally should migrate. Those are different conclusions, and the gap between them is where teams waste quarters chasing a headline saving that was never theirs to capture.
The bottom line
The great TPU migration is real, but it is best understood as workload diversification by a small number of very large operators, not the end of the GPU era. The roughly 65% saving is, per reporting, a frontier-scale figure — useful as evidence the direction is real, useless as a forecast for a startup's bill. Google's claimed 80% performance-per-dollar gain for the TPU 8i is the metric that matters, and the fact that it is the headline tells you the industry has fully shifted from a training-cost mindset to an inference-cost one.
For Indian and UK builders, the actionable version is short. If you use a managed API, do nothing and let provider competition work for you. If you self-host at modest scale, fix caching, batching, routing and utilisation before you think about silicon at all. If you self-host at real volume with steady traffic and a mainstream architecture, a TPU benchmark is a reasonable use of an engineer's week — go in with eyes open about JAX, XLA and the porting cost. And whatever you decide, keep your stack portable, because the one safe prediction is that the cheapest place to run inference will keep moving.