The inference economy in May 2026

The numbers behind serverless inference have moved further in the last twelve months than in the previous three years combined. Blended cost per million tokens on open-weight frontier models has fallen by roughly an order of magnitude since early 2025; H100 spot prices have softened by a comparable margin, as we tracked in our H100 pricing guide. Inference is now cheap enough to ship products that genuinely depend on it — a thesis we explored in why falling inference costs unlock profitable products.

Capital is following the curve. DeepInfra closed a $107M Series B in April, covered in our DeepInfra Series B write-up. Together AI continues to grow its multi-tenant fleet, Fireworks has tightened its training-plus-serving loop, and the custom-silicon vendors — Groq and Cerebras — have published throughput numbers that look fictional next to a standard GPU box. The sovereign-GPU layer is filling out around all of this: Yotta and Tata in India, Nscale and the new UK AI Research Resource clusters in Britain. Most of that capacity is still going to training, not serving, which keeps the structural problem from our utilisation crisis piece very much alive.

The market is bifurcating. Custom silicon — Groq's LPU, Cerebras's wafer-scale engine, SambaNova's reconfigurable dataflow — competes on raw speed. GPU platforms — DeepInfra, Together AI, Fireworks, Baseten, Modal — compete on model breadth, fine-tuning, batch, and developer experience. The decision is no longer "which API has the lowest latency"; it is "which side of the split fits my workload, and which vendor on that side fits my team".

The five at a glance

The compressed view first. Indicative pricing and performance figures below are as of April 2026; all five vendors publish current numbers on their pricing pages.

Provider Architecture Best-known model $/M tokens (blended) TTFT Throughput Best for
Cerebras WSE wafer-scale gpt-oss-120B ~$0.60 ~1.0s ~3,000 tok/s Long-form generation, very large models
Groq LPU custom silicon gpt-oss-120B ~$0.30 0.6–0.9s ~476 tok/s Voice agents, interactive chat
DeepInfra GPU cloud gpt-oss-120B, Kimi K2, Qwen3.5 ~$0.08 ~1.2s ~120 tok/s Cheapest serverless, widest open catalogue
Together AI GPU cloud Llama, Qwen3.5, DeepSeek ~$0.12 ~1.3s ~140 tok/s Fine-tune + serve, batch, volume discounting
Fireworks GPU cloud Llama, Qwen, DeepSeek V3.2 ~$0.15 ~1.0s ~160 tok/s Cleanest API, fastest new-model integration
Pro tip

Treat headline tokens-per-second numbers like quoted top speeds on a car spec sheet. The figure you need for your architecture call is the p95 latency at your real concurrency, on your real prompt length, against your real output length. All five vendors will run a benchmark for you on request — ask for the percentile breakdown, not the mean.

Cerebras: raw speed via wafer-scale chips

Cerebras's Wafer-Scale Engine collapses a whole rack of GPUs onto a single silicon wafer the size of a dinner plate. The trick is that the entire model lives on-chip — no cross-device chatter, no NVLink hops, no memory hierarchy gymnastics. For gpt-oss-120B that translates to roughly 3,000 output tokens per second, which is the kind of number that changes what kinds of products feel possible. Long-form generation that would take twenty seconds on a standard GPU pod is over in two.

The right time to reach for Cerebras is when the dominant cost in your user experience is output length. A Mumbai startup generating long-form legal summaries from filings is a textbook fit: the user is happy to wait one second for first token but cannot wait thirty for a full eight-thousand-token output. The free tier with daily token caps makes it cheap to validate the shape of the workload first.

# Cerebras — OpenAI-compatible client
from openai import OpenAI

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key=CEREBRAS_KEY,
)

resp = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=8000,
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

Groq: lowest first-token latency

Groq's Language Processing Unit takes a different bet. Where Cerebras optimises for sustained throughput on a huge model, the LPU is engineered for deterministic, low-latency inference on a fixed dataflow graph. The number that matters in practice is time-to-first-token: across the Groq catalogue, that figure sits between 0.6 and 0.9 seconds, more or less independently of model size. If your application is a voice agent — where the user hears silence until the first token arrives — that floor is the single most important specification on the page.

The pattern that keeps showing up: a London fintech building a customer-facing voice agent picks Groq not because it is the cheapest but because perceived latency is the difference between "magic" and "broken". A Bengaluru BPO offering bilingual Hindi–English voice support comes to the same conclusion. The free tier with daily caps lets a small team prototype an end-to-end loop in a day.

# Groq — OpenAI-compatible client, streaming
from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=GROQ_KEY,
)

stream = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=conversation,
    stream=True,
    temperature=0.2,
)
first_token_at = None
for chunk in stream:
    if first_token_at is None and chunk.choices[0].delta.content:
        first_token_at = time.time()    # 0.6–0.9s in practice
    handle(chunk)

DeepInfra: cheapest per-token, widest open-weight catalogue

If your workload is dominated by token volume rather than peak latency, DeepInfra is hard to beat on price. Blended cost on gpt-oss-120B sits at roughly $0.08 per million tokens as of April 2026 — comfortably below the GPU-platform pack and dramatically below the closed frontier APIs. The catalogue is the real differentiator: Kimi K2 family, Qwen3.5 family, GLM-5, DeepSeek V3.2, MiniMax-M2, gpt-oss-120B, the NVIDIA Nemotron lineup. If a Chinese open-weight model lands on a Tuesday, it is usually on DeepInfra by Thursday — a point we made in our Chinese open-weight coding models round-up.

The pricing math gets interesting at scale. A team running ten million tokens a day on gpt-oss-120B is paying somewhere near $24 a month on DeepInfra at headline rates. The same workload on a closed frontier API at $3 per million blended is $900 a month — nearly forty times more. That gap pays for a lot of evaluation engineering and even more user-facing features. DeepInfra hands out signup credits rather than a permanent free tier, which is honestly fine for most teams: a few hundred dollars of credit will get you through a fortnight of serious testing.

# DeepInfra — OpenAI-compatible
from openai import OpenAI

client = OpenAI(
    base_url="https://api.deepinfra.com/v1/openai",
    api_key=DEEPINFRA_KEY,
)

# Wide catalogue: swap the model id, keep the call shape
for model in ("openai/gpt-oss-120b",
              "moonshotai/Kimi-K2-Instruct",
              "Qwen/Qwen3.5-72B-Instruct",
              "deepseek-ai/DeepSeek-V3.2"):
    r = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    score(model, r)

Together AI: price leader at scale + fine-tuning

Together's positioning is closest to the "full-stack inference platform" archetype. The pricing is competitive at retail and aggressive at volume; the catalogue is broad enough; the genuine moat is that fine-tuning and serving share the same API. You upload a dataset, kick off a LoRA or full fine-tune, and the resulting weights are available behind a serving endpoint without any of the usual handoff pain.

For teams iterating on a domain model — say, a Pune-based legal-tech team tuning a 7B for Indian contract clauses, or a UK insurance carrier tuning a 13B on five years of broker emails — the same-API loop is the productivity win. The alternative is to fine-tune on RunPod or your own cluster, push the artefact somewhere, and write the serving wrapper yourself. That is a week of plumbing per iteration; with Together it is a function call. Together is also strong at batch and multi-tenant workloads, which makes it the default pick for the kind of overnight bulk-RAG job we describe later in this piece.

# Together AI — fine-tune then serve, same API
import together

together.api_key = TOGETHER_KEY

job = together.Finetune.create(
    training_file="file-xxx",
    model="meta-llama/Llama-3.1-8B-Instruct",
    n_epochs=3,
    lora=True,
)
together.Finetune.wait(job.id)

# The tuned model is immediately available as a serving endpoint
out = together.Complete.create(
    model=job.fine_tuned_model_name,
    prompt="Summarise this clause:\n" + clause,
)

Fireworks: cleanest API, fastest new-model integration

Fireworks competes on developer experience. The API is the cleanest of the five — sensible defaults, good error messages, a Python SDK that does not surprise you — and the team is consistently among the first to host new open-weight releases. When a model drops on Hugging Face on a Friday, Fireworks frequently has it served by Monday with a properly documented endpoint. That speed matters more than it sounds: it is the difference between your team being two weeks ahead of the curve or two weeks behind it.

Like Together, Fireworks supports training and serving through the same API. Pricing sits slightly above Together at retail, but the DX tax is usually worth it for teams that do not yet have a dedicated platform engineer. If your team is three engineers shipping a product, Fireworks is the path of least resistance.

# Fireworks — OpenAI-compatible
from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=FIREWORKS_KEY,
)

r = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p2",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=4096,
    response_format={"type": "json_object"},
)

Building inference-heavy products in India or the UK?

The Verified Builder network on AI Tech Connect includes engineers shipping production workloads on every one of these platforms. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Implementation patterns by workload

Three concrete patterns, each matched to a vendor with a one-line decision rule.

Pattern A — interactive chat agent

Single user, conversational turn, sub-second perceived latency. The user metric is time-to-first-token; total throughput is secondary because the human eye stops reading much past 30 tokens per second anyway. Pick Groq. The 0.6–0.9 second TTFT floor is the spec; everything else is decoration. If the model you need is not in the Groq catalogue, fall back to Fireworks for the cleanest API and acceptable latency. Use DeepInfra only if the dominant constraint is cost and the user will tolerate a second or so more lag.

Pattern B — RAG batch over a large corpus

Ten thousand documents, overnight job, build embeddings and pre-summaries. The user metric is dollars per processed document; latency per call is irrelevant as long as the batch finishes by morning. Pick DeepInfra, with Together as the alternative once you cross volume thresholds where its discounting kicks in. A Mumbai startup ingesting two years of regulatory filings, or a Manchester healthtech building a private-corpus assistant, will both arrive at the same answer. The decision rule is: if you have a procurement relationship with one already, stay; otherwise default to DeepInfra for the cheaper per-token rate and the wider catalogue.

Pattern C — fine-tuned domain model

You have a labelled dataset and an open-weight base model; you need to tune, evaluate, iterate, and serve. Pick Together or Fireworks. Both support the train-and-serve loop through one API; both will save you the week of plumbing the alternatives demand. Choose Together for slightly better price at scale; choose Fireworks for slightly better developer experience. If the model in question is one of the very large open-weight frontier models — Kimi K2, DeepSeek V3.2, GLM-5 — and you are doing LoRA rather than a full fine-tune, DeepInfra can serve the base model and let you BYO LoRA adapters.

Watch out

Multi-region failover and data-residency are not free on serverless inference. None of the five vendors offer the same regional footprint as a hyperscaler. For Indian banks under RBI outsourcing guidance, or UK NHS Trusts under the UK GDPR data-processing constraints, the realistic answer is usually a hybrid: serverless for the bulk of traffic, plus a dedicated tenancy or self-hosted deployment for the regulated slice. Bake that into your architecture from day one, not as a panicked retrofit.

Cost worked-example: 10M tokens a day across five vendors

A realistic mid-stage workload — ten million tokens a day, blended input plus output on gpt-oss-120B-class models, no volume discounting assumed. Headline list prices, April 2026.

Provider $/M blended $/day at 10M $/month (~30 days) Notes
DeepInfra $0.08 $0.80 ~$24 Cheapest serverless; widest catalogue
Together AI $0.12 $1.20 ~$36 Tighter still at committed volume
Fireworks $0.15 $1.50 ~$45 Best DX, slight premium
Groq $0.30 $3.00 ~$90 Pays for itself if TTFT matters
Cerebras $0.60 $6.00 ~$180 Pays for itself on long-output workloads

Two things to read out of the table. The absolute numbers are tiny — even the most expensive option clears six dollars a day for a workload that two years ago was a luxury reserved for funded startups. That collapse is the reason consumer AI products are viable now; see our DeepSeek V4 Flash Pro cost analysis for more. But the ratios still matter: the gap between DeepInfra and Cerebras is roughly 7.5×, and if traffic grows ten times this year that is the difference between thirty pounds a month and three hundred.

Builder says

"We started on a closed frontier API because that's what the demo tutorials used. Six weeks in we were spending more on inference than on the rest of our infrastructure combined. We moved the 'long tail' of cheap queries to DeepInfra on gpt-oss-120B and kept the premium API only for the top-of-funnel evaluation step. Bill dropped by about eighty per cent and our users didn't notice. The lesson: you almost never need the most expensive model for every call." — A Verified Builder · Bengaluru

The decision matrix — pick one

The compressed heuristic, in order of decision weight:

  • Voice or interactive chat with sub-second TTFT? Groq. If your model is not in their catalogue, Fireworks.
  • Long-form generation where output length dominates? Cerebras. Use the free tier first to validate the shape.
  • High-volume batch or RAG where cost per token dominates? DeepInfra. Together when committed-volume discounting closes the gap.
  • Fine-tuning loop on open weights? Together or Fireworks. Together for cost at scale, Fireworks for developer experience.
  • Need the widest open-weight catalogue and the cheapest per-token? DeepInfra by default.
  • Regulated workload with hard data-residency requirements? None of the above on their own; design for hybrid from day one.

Whichever you pick, build your code against your own thin client interface rather than directly against the vendor's SDK. All five expose OpenAI-compatible endpoints, which makes a one-line base URL switch genuinely possible — but only if your domain code has not absorbed any vendor-specific assumptions. The inference market is moving fast; the team that can move workloads between vendors in an afternoon will compound far ahead of the team that cannot.

Primary sources for this piece: deepinfra.com/pricing, groq.com, cerebras.ai, together.ai, and fireworks.ai.