The structural shift, not the IPO mechanics

Cerebras Systems went public on 14 May 2026, closing its first trading day at a fully diluted valuation of roughly $56 billion. The headline is easy to write and easy to misread. The actual story is not that a niche AI hardware company finally listed; it is what the company committed to in the weeks before pricing — a multi-year deal with OpenAI to deliver up to 750 megawatts of inference capacity over three years on Cerebras's wafer-scale engines. That single anchor customer is what the public-market valuation is really pricing.

And that contract is the visible edge of something larger: inference is splitting away from training as a separate market, with separate buyers, separate economics and increasingly separate silicon. Nvidia's GPUs still own training. Inference is now a three-way fight between Nvidia, Cerebras with its wafer-scale engines (WSE) and Groq with its language-processing units (LPUs). For builders in Bengaluru, Hyderabad, London and Manchester deciding where to spend their next inference rupee or pound, the menu just got materially more interesting.

  • $56B Cerebras day-one close on 14 May 2026 — the largest pure-play AI inference IPO to date.
  • 750MW OpenAI commitment over three years — see the NetworkWorld coverage of the deal mechanics.
  • Nvidia's $20B Groq licensing deal earlier this year tells the same story from the other side, as VentureBeat framed it: even Nvidia is acknowledging that inference deserves its own silicon.
  • Inference cost is falling roughly 95% year-on-year, which compounds the case for picking the right silicon class today rather than overcommitting to one vendor.
Pro tip

Stop thinking of "AI compute" as one budget line. Split your spreadsheet now: training, fine-tuning, inference (latency-sensitive), inference (batch). The right silicon — and the right vendor — is different for each. Teams that still treat compute as a single line item are over-paying on at least one of them.

What 750 megawatts actually buys

750MW is the kind of number that sounds abstract until you sit with it. A mid-sized city draws roughly that. A modern hyperscale data-centre campus might be 100 to 300MW. OpenAI's commitment, ramped across three years, is on the order of three to five full-scale campuses worth of Cerebras wafer-scale capacity dedicated to inference.

The economic logic is straightforward. OpenAI's inference bill is now its single largest variable cost. Every percentage point shaved off cost-per-million-tokens, multiplied across ChatGPT, the API, and the agentic products downstream, runs into nine figures over a multi-year window. Cerebras's wafer-scale engine reports roughly 3,000 tokens per second on the open gpt-oss-120B model — close to an order of magnitude faster than a typical GPU-served deployment of the same model, with corresponding savings on $/MTok for streaming workloads.

If you have followed our earlier piece on inference costs falling roughly 95% year-on-year, this deal fits the trajectory: the marginal cost of a token is collapsing, and the buyers who lock in the next generation of throughput-optimised silicon get to ride that curve hardest.

The three-way inference market in one table

The cleanest way to compare what is on offer today, with public pricing and benchmark numbers from each vendor:

Provider Chip Sample model Tokens/sec $/M input $/M output Best for
Cerebras WSE (Wafer-Scale Engine) gpt-oss-120B ~3,000 $0.35 $0.75 Latency-sensitive streaming, agent loops where tokens/sec is the bottleneck
Groq LPU gpt-oss-120B / Llama 3.3 70B ~476 (0.6–0.9s TTFT) ~$0.26 blended ~$0.79 (Llama 3.3 70B) High-volume open-model inference where cost/MTok is the bottleneck
Nvidia H100 GPU Any (broadest support) Workload-dependent Provider-dependent Provider-dependent Training + flexible inference, models not yet hosted elsewhere
Nvidia B200 / DGX B300 GPU Frontier + mixed Workload-dependent Provider-dependent Provider-dependent New training runs, frontier-model deployment, mixed-workload clusters

If you are comparing providers at the API-rental level rather than the silicon level, our coverage of DeepInfra vs Together vs Fireworks vs Groq vs Cerebras walks the platform-by-platform trade-offs in more detail. The short version: each of the wafer-scale and LPU vendors has a tight catalogue of supported models, optimised hard; the GPU-backed platforms host everything but rarely lead on $/MTok for the models the specialists do support.

Why GPUs are not going anywhere — they are just specialising

It would be tempting to read the Cerebras IPO and the Nvidia–Groq deal as bad news for GPUs. It is not. It is the inference half of the workload migrating, and Nvidia knows it. The B200 generation is being positioned for training and mixed-workload deployment; the headline-grabbing comparison work we covered in B200 vs H100 inference economics shows GPUs continuing to improve where they are strong.

The pricing context: NVIDIA H100 lists $25K–$40K, B200 $30K–$50K, and a DGX B300 system roughly $300K–$350K. Those are training-fleet numbers. On the rental side, our H100 price-decline guide showed cloud rentals continuing to soften through 2026 as supply catches up. So even on inference, GPUs remain competitive for models that the specialist silicon does not host, for mixed workloads where you want one cluster to do many things, and for any team whose CUDA-derived software stack is the binding constraint.

Watch out

Do not benchmark on the wrong axis. Wafer-scale and LPU providers will quote you a tokens-per-second number that looks crushing. That matters only if your product is latency-bound — chat UIs, voice, agent loops with tight think-act cycles. For a batch summarisation job that runs overnight, tokens-per-second is irrelevant; $/MTok and model availability are the only numbers that should move the decision.

What Indian and UK builders should actually do this quarter

For most teams in India and the UK, the right action is not "switch providers". It is "split the workload". Three concrete patterns we are seeing:

Bengaluru SaaS team running agentic copilots. Their bottleneck is tokens-per-second inside a multi-step agent loop. Slow tokens mean slow agents mean unhappy users. They are benchmarking Cerebras's gpt-oss-120B endpoint against their incumbent GPU-served Llama deployment. If the latency improvement holds at production load, the per-call economics work even at slightly higher $/MTok, because user-perceived latency drops by 4 to 5 seconds per agent turn.

UK fintech with high-volume document classification. Tens of millions of short classifications per month. Latency does not matter — these run in a queue. $/MTok is everything. Groq's pricing on Llama 3.3 70B is the obvious benchmark target; the team is comparing it against a self-hosted vLLM deployment on rented H100s. The break-even point is sensitive to utilisation: above ~60% sustained utilisation, self-host wins; below it, Groq's pay-per-token model wins.

Indian healthtech serving inference inside a data-residency boundary. Neither Cerebras nor Groq has an India region today. For this team, the choice collapses to GPU-hosted inference in an Indian-hosted Kubernetes cluster — likely on H100s rented from a domestic provider, or on the emerging Krutrim Bodhi-1 sovereign silicon stack as that ecosystem matures. The wafer-scale economics are simply not accessible to them yet.

Good move

If you are running an open-weight model — anything in the Llama, Qwen, DeepSeek or gpt-oss families — benchmark all three classes (WSE, LPU, GPU) before you renew your inference contract. The 90 minutes it takes to set up identical eval harnesses against each provider is the single highest-ROI engineering hour in your quarter. Vendors will quote different headline numbers; your own workload, run end to end, is the only number that matters.

The open-weight angle — why this market split needs alt-silicon at all

None of this market division would matter if every serious model were locked behind a closed API. The reason wafer-scale and LPU providers can compete at all is that the open-weight model ecosystem has caught up. As we covered in GLM-4.7 running at zero hallucination rate on non-Nvidia silicon, the open-model frontier is no longer hardware-bound to CUDA. That is the precondition for Cerebras and Groq to have a real business — without portable open weights, there is no inference workload to chase.

For Indian and UK teams, that is the strategic read: the splitting of inference into a specialised silicon market is downstream of the open-weight thaw. Keep one eye on which open models are gaining adoption, because those are the models the specialist silicon vendors will optimise first. The team that picks an open model with broad multi-vendor support is buying itself optionality; the team that locks into a closed-API model is, by definition, locked to that vendor's silicon roadmap as well.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written or vetted by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

What the Cerebras IPO actually validates

Public markets are blunt instruments, but they are useful signals. A $56B day-one close on a company whose entire thesis is "inference deserves its own silicon" is the market voting on the structural-split argument. The 750MW OpenAI commitment is the proof point that the largest inference buyer in the world agrees. Nvidia's $20B Groq deal is the same vote cast from the other side of the room.

None of these three data points individually is conclusive. Together they form the same picture: the era when "AI compute" meant "more H100s" is over. Training will continue to live on Nvidia GPUs for the foreseeable future, with B200 and B300 succeeding H100 as the default. Inference is now a three-way decision — wafer-scale for latency-bound streaming, LPUs for cost-bound bulk, GPUs for everything else — and that decision is one that every team running production AI needs to revisit this quarter, not next year.

The bottom line

Cerebras is now a public company worth $56 billion because it convinced the largest inference buyer in the world that wafer-scale is the right silicon for a $750-million-plus electricity bill. That is a strong validation, not a closing argument. The inference market has split, and the smartest teams in Bengaluru, Mumbai, London and Edinburgh are already running parallel benchmarks rather than picking sides.

Primary sources: NetworkWorld on the OpenAI–Cerebras 750MW deal and VentureBeat on the structural inference split.