What is wafer-scale and how does it differ from a GPU?

A wafer-scale engine is a single piece of silicon roughly the size of an entire 300mm wafer, with hundreds of thousands of cores and on-die SRAM measured in gigabytes. A GPU is a single die cut from a wafer, with cores in the tens of thousands and HBM connected over an interposer. The trade-off: wafer-scale removes off-chip memory bandwidth as the bottleneck and excels at low-latency token streaming; GPUs remain the most flexible and best-supported option for training and mixed workloads.

Should Indian or UK teams move workloads from Nvidia to Cerebras?

Only the inference half, and only for models Cerebras hosts. If you serve a supported model — gpt-oss-120B, Llama-class, or similar — and your bottleneck is tokens per second per pound or rupee, the wafer-scale economics are worth a benchmark. Keep training and bespoke model work on Nvidia for now; the software stack and ecosystem support there is still unmatched.

What does the 750MW number actually mean?

It is a multi-year commitment by OpenAI to purchase up to 750 megawatts of Cerebras compute capacity over three years. At roughly the power draw of a mid-sized city, it is large enough to materially shift OpenAI's inference cost base and to fund Cerebras's manufacturing scale-up. It is a capacity-ceiling commitment, not an instant deployment.

How does Cerebras inference pricing compare to Groq?

Cerebras lists gpt-oss-120B at roughly $0.35 input / $0.75 output per million tokens. Groq sits around $0.26 per million blended on Llama 3.3 70B output at about $0.79 per million. The headline takeaway: Groq is cheaper per token on similar-class open models, Cerebras runs them noticeably faster — about 3,000 tokens per second on gpt-oss-120B versus Groq's ~476 tokens per second on the same model. Choose on whichever axis your product is bottlenecked.

When does the OpenAI deal start delivering capacity?

The deal is a multi-year commitment spanning three years from signing. Capacity ramps as Cerebras stands up wafer-scale clusters at hosting partners; the full 750MW is a ceiling reached toward the back end of the term rather than a day-one number. Builders should expect gradual price and availability improvements on Cerebras-hosted inference through 2026 and 2027.

Cerebras IPO at $56B: Why OpenAI Bet 750MW on Wafer-Scale

The structural shift, not the IPO mechanics

Cerebras Systems went public on 14 May 2026, closing its first trading day at a fully diluted valuation of roughly $56 billion. The headline is easy to write and easy to misread. The actual story is not that a niche AI hardware company finally listed; it is what the company committed to in the weeks before pricing — a multi-year deal with OpenAI to deliver up to 750 megawatts of inference capacity over three years on Cerebras's wafer-scale engines. That single anchor customer is what the public-market valuation is really pricing.

And that contract is the visible edge of something larger: inference is splitting away from training as a separate market, with separate buyers, separate economics and increasingly separate silicon. Nvidia's GPUs still own training. Inference is now a three-way fight between Nvidia, Cerebras with its wafer-scale engines (WSE) and Groq with its language-processing units (LPUs). For builders in Bengaluru, Hyderabad, London and Manchester deciding where to spend their next inference rupee or pound, the menu just got materially more interesting.

$56B Cerebras day-one close on 14 May 2026 — the largest pure-play AI inference IPO to date.
750MW OpenAI commitment over three years — see the NetworkWorld coverage of the deal mechanics.
Nvidia's $20B Groq licensing deal earlier this year tells the same story from the other side, as VentureBeat framed it: even Nvidia is acknowledging that inference deserves its own silicon.
Inference cost is falling roughly 95% year-on-year, which compounds the case for picking the right silicon class today rather than overcommitting to one vendor.

Pro tip

Stop thinking of "AI compute" as one budget line. Split your spreadsheet now: training, fine-tuning, inference (latency-sensitive), inference (batch). The right silicon — and the right vendor — is different for each. Teams that still treat compute as a single line item are over-paying on at least one of them.

What 750 megawatts actually buys

750MW is the kind of number that sounds abstract until you sit with it. A mid-sized city draws roughly that. A modern hyperscale data-centre campus might be 100 to 300MW. OpenAI's commitment, ramped across three years, is on the order of three to five full-scale campuses worth of Cerebras wafer-scale capacity dedicated to inference.

The economic logic is straightforward. OpenAI's inference bill is now its single largest variable cost. Every percentage point shaved off cost-per-million-tokens, multiplied across ChatGPT, the API, and the agentic products downstream, runs into nine figures over a multi-year window. Cerebras's wafer-scale engine reports roughly 3,000 tokens per second on the open gpt-oss-120B model — close to an order of magnitude faster than a typical GPU-served deployment of the same model, with corresponding savings on $/MTok for streaming workloads.

If you have followed our earlier piece on inference costs falling roughly 95% year-on-year, this deal fits the trajectory: the marginal cost of a token is collapsing, and the buyers who lock in the next generation of throughput-optimised silicon get to ride that curve hardest.

The three-way inference market in one table

The cleanest way to compare what is on offer today, with public pricing and benchmark numbers from each vendor:

Provider	Chip	Sample model	Tokens/sec	$/M input	$/M output	Best for
Cerebras	WSE (Wafer-Scale Engine)	gpt-oss-120B	~3,000	$0.35	$0.75	Latency-sensitive streaming, agent loops where tokens/sec is the bottleneck
Groq	LPU	gpt-oss-120B / Llama 3.3 70B	~476 (0.6–0.9s TTFT)	~$0.26 blended	~$0.79 (Llama 3.3 70B)	High-volume open-model inference where cost/MTok is the bottleneck
Nvidia H100	GPU	Any (broadest support)	Workload-dependent	Provider-dependent	Provider-dependent	Training + flexible inference, models not yet hosted elsewhere
Nvidia B200 / DGX B300	GPU	Frontier + mixed	Workload-dependent	Provider-dependent	Provider-dependent	New training runs, frontier-model deployment, mixed-workload clusters

If you are comparing providers at the API-rental level rather than the silicon level, our coverage of DeepInfra vs Together vs Fireworks vs Groq vs Cerebras walks the platform-by-platform trade-offs in more detail. The short version: each of the wafer-scale and LPU vendors has a tight catalogue of supported models, optimised hard; the GPU-backed platforms host everything but rarely lead on $/MTok for the models the specialists do support.

Why GPUs are not going anywhere — they are just specialising

It would be tempting to read the Cerebras IPO and the Nvidia–Groq deal as bad news for GPUs. It is not. It is the inference half of the workload migrating, and Nvidia knows it. The B200 generation is being positioned for training and mixed-workload deployment; the headline-grabbing comparison work we covered in B200 vs H100 inference economics shows GPUs continuing to improve where they are strong.

The pricing context: NVIDIA H100 lists $25K–$40K, B200 $30K–$50K, and a DGX B300 system roughly $300K–$350K. Those are training-fleet numbers. On the rental side, our H100 price-decline guide showed cloud rentals continuing to soften through 2026 as supply catches up. So even on inference, GPUs remain competitive for models that the specialist silicon does not host, for mixed workloads where you want one cluster to do many things, and for any team whose CUDA-derived software stack is the binding constraint.

Watch out

Do not benchmark on the wrong axis. Wafer-scale and LPU providers will quote you a tokens-per-second number that looks crushing. That matters only if your product is latency-bound — chat UIs, voice, agent loops with tight think-act cycles. For a batch summarisation job that runs overnight, tokens-per-second is irrelevant; $/MTok and model availability are the only numbers that should move the decision.

What Indian and UK builders should actually do this quarter

For most teams in India and the UK, the right action is not "switch providers". It is "split the workload". Three concrete patterns we are seeing:

Bengaluru SaaS team running agentic copilots. Their bottleneck is tokens-per-second inside a multi-step agent loop. Slow tokens mean slow agents mean unhappy users. They are benchmarking Cerebras's gpt-oss-120B endpoint against their incumbent GPU-served Llama deployment. If the latency improvement holds at production load, the per-call economics work even at slightly higher $/MTok, because user-perceived latency drops by 4 to 5 seconds per agent turn.

UK fintech with high-volume document classification. Tens of millions of short classifications per month. Latency does not matter — these run in a queue. $/MTok is everything. Groq's pricing on Llama 3.3 70B is the obvious benchmark target; the team is comparing it against a self-hosted vLLM deployment on rented H100s. The break-even point is sensitive to utilisation: above ~60% sustained utilisation, self-host wins; below it, Groq's pay-per-token model wins.

Indian healthtech serving inference inside a data-residency boundary. Neither Cerebras nor Groq has an India region today. For this team, the choice collapses to GPU-hosted inference in an Indian-hosted Kubernetes cluster — likely on H100s rented from a domestic provider, or on the emerging Krutrim Bodhi-1 sovereign silicon stack as that ecosystem matures. The wafer-scale economics are simply not accessible to them yet.

Good move

If you are running an open-weight model — anything in the Llama, Qwen, DeepSeek or gpt-oss families — benchmark all three classes (WSE, LPU, GPU) before you renew your inference contract. The 90 minutes it takes to set up identical eval harnesses against each provider is the single highest-ROI engineering hour in your quarter. Vendors will quote different headline numbers; your own workload, run end to end, is the only number that matters.

The open-weight angle — why this market split needs alt-silicon at all

None of this market division would matter if every serious model were locked behind a closed API. The reason wafer-scale and LPU providers can compete at all is that the open-weight model ecosystem has caught up. As we covered in GLM-4.7 running at zero hallucination rate on non-Nvidia silicon, the open-model frontier is no longer hardware-bound to CUDA. That is the precondition for Cerebras and Groq to have a real business — without portable open weights, there is no inference workload to chase.

For Indian and UK teams, that is the strategic read: the splitting of inference into a specialised silicon market is downstream of the open-weight thaw. Keep one eye on which open models are gaining adoption, because those are the models the specialist silicon vendors will optimise first. The team that picks an open model with broad multi-vendor support is buying itself optionality; the team that locks into a closed-API model is, by definition, locked to that vendor's silicon roadmap as well.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written or vetted by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

What the Cerebras IPO actually validates

Public markets are blunt instruments, but they are useful signals. A $56B day-one close on a company whose entire thesis is "inference deserves its own silicon" is the market voting on the structural-split argument. The 750MW OpenAI commitment is the proof point that the largest inference buyer in the world agrees. Nvidia's $20B Groq deal is the same vote cast from the other side of the room.

None of these three data points individually is conclusive. Together they form the same picture: the era when "AI compute" meant "more H100s" is over. Training will continue to live on Nvidia GPUs for the foreseeable future, with B200 and B300 succeeding H100 as the default. Inference is now a three-way decision — wafer-scale for latency-bound streaming, LPUs for cost-bound bulk, GPUs for everything else — and that decision is one that every team running production AI needs to revisit this quarter, not next year.

The bottom line

Cerebras is now a public company worth $56 billion because it convinced the largest inference buyer in the world that wafer-scale is the right silicon for a $750-million-plus electricity bill. That is a strong validation, not a closing argument. The inference market has split, and the smartest teams in Bengaluru, Mumbai, London and Edinburgh are already running parallel benchmarks rather than picking sides.

Primary sources: NetworkWorld on the OpenAI–Cerebras 750MW deal and VentureBeat on the structural inference split.