TL;DR: when does self-hosting beat the API in 2026?

Three quick rules, then the tables.

  • Self-host wins when your inference workload is sustained, predictable and large enough to keep at least one B200 above ~50% utilisation 24/7. Cost-per-million-tokens collapses to around $0.02 at that point.
  • API wins when traffic is bursty, when you swap models more than once a quarter, or when you cannot dedicate a platform engineer to the serving stack.
  • Hybrid wins for everyone else. Run the steady-state slice on a reserved B200; spill the long tail to an API or to TPU spot capacity.
Pro tip

Run the utilisation calculation before the price calculation. A B200 reserved at $2.25/hr ($1,650/month, 24/7) only beats a per-token API once you can keep it meaningfully busy. Idle silicon is the most expensive silicon there is.

The hardware price ladder

If you are buying outright — colocated cage, on-prem GPU pod, or a tower under a desk for the research team — the 2026 ladder is now well-defined. B200 is the volume part. B300 is the premium part. DGX systems are the full reference platforms NVIDIA expects enterprises to standardise on.

Part Configuration Street price (USD) Best for
H100 SXM Single GPU module ~$27,000–$40,000 Mature inference stacks already optimised for Hopper
B200 Single GPU module ~$35,000 The 2026 default for new inference clusters
B300 Single GPU module $40,000–$50,000 Larger memory footprint, premium throughput
DGX B300 8x B300, fully integrated chassis ~$300,000–$350,000 Production inference pods, training fine-tunes
DGX Station B300 + Grace CPU, desktop tower $80,000–$125,000 Research workstations, single-team dev pods
H200 8-GPU server Reference 8x H200 ~$315,000 Mature, available now from any OEM

Two things to notice. First, a single B200 at $35k is not very far above a single H100 SXM — the premium is modest and the inference throughput multiplier is several-fold. Second, the gap between a DGX Station and a DGX B300 is roughly 3x for not-quite-3x the silicon — that delta is essentially the cost of NVIDIA's full-fat networking, NVLink fabric, and the warranty halo that risk-averse procurement teams pay for.

The cloud-rental ladder

If you are not buying — and most teams should not be — the rental market is where the real decisions get made. The May 2026 picture is messy because there are at least three distinct tiers of cloud, each with their own pricing model.

Provider tier B200 on-demand B200 reserved (12 mo) B200 reserved (36 mo) Notes
Hyperscaler (AWS / Azure / GCP) $10–$14.24/hr $5.50–$7/hr $4–$5.50/hr Predictable, audited, premium
Specialised AI cloud (CoreWeave, Lambda, Crusoe) $4.50–$6/hr $3.50–$4.50/hr $2.65–$3.50/hr Largest pool of Blackwell capacity
Neocloud (Thunder, Vast, RunPod tier) $2.99–$4/hr $2.50–$3.50/hr $2.25–$2.65/hr Cheapest, fewer SLAs, variable availability
GB200 NVL72 (full rack) $7–$8/hr per GPU Frontier-scale training workloads
H100 SXM (reference floor) $2.00–$3.50/hr $1.80–$2.50/hr $1.50–$2.20/hr Plentiful, mature, the price-anchor

The headline: a B200 reserved at $2.25/hr on a 36-month commit gets you a single dedicated GPU for roughly $1,650 per month. That is the number to keep in your head — because it is the threshold that decides everything that follows.

Watch out

The 36-month commit at $2.25/hr is a contract. If you walk away after 6 months, the effective rate is double. Reserved pricing is only cheap if your roadmap really does need that GPU for three years — which, given how fast the field moves, is a non-trivial bet.

Cost per million tokens, plainly

The number that actually matters to your finance team is not $/GPU-hr. It is $/million-tokens-served. NVIDIA's own Blackwell economics analysis puts the gap at roughly $0.02 per million tokens on B200 versus $0.14 on H100 for a representative open-weight 8B-parameter inference workload — a 7× reduction. The intermediate throughput numbers vary by serving engine and batch shape; the per-token cost is the durable comparison.

Option $/hour Throughput (Mtok/hr) $/Mtok Notes
H100 self-host (reserved) $2.00 ~14 ~$0.14 Mature, well-tooled
B200 self-host (reserved) $2.25 ~110+ ~$0.02 7x cheaper per token than H100
B200 self-host (on-demand) $5.00 ~110+ ~$0.045 Still beats H100 reserved
Anthropic Claude Sonnet API $3.00 in / $15 out Frontier quality, no ops
OpenAI GPT-5.5 API $2.50 in / $10 out Frontier quality, no ops
Gemini 3.5 Flash API $0.30 in / $2.50 out Cheap API floor

Two important caveats. First, the $0.02/Mtok number is for an 8B open-weight model — not a frontier model. If the workload genuinely needs Claude or GPT-class quality, you are not actually choosing between $0.02 and $3.00; you are choosing between two different products. Second, even Gemini 3.5 Flash at $0.30/$2.50 is well above a self-hosted Qwen-class open-weight model on B200 — but the API has zero ops, zero capacity planning, and zero idle-GPU risk.

When self-host clearly wins

If you are serving more than ~500M tokens per month of an open-weight model you have already validated, on traffic that does not vary more than 3x between peak and trough, B200 self-hosted on a 12-month reserve will be 5–10x cheaper than any frontier API. That is the case worth building infrastructure for.

The self-host vs API decision matrix

Combine workload type, team size, and throughput, and you get a defensible recommendation. We have ground-truthed this against roughly two dozen Indian and UK teams who have run the experiment in the last six months.

Workload Team size Throughput Recommendation
Always-on, predictable 10+ engineers, 1 platform lead >500M tok/month Self-host on B200 reserved
Always-on, predictable <5 engineers, no platform lead >500M tok/month API, or managed inference (Together, Fireworks)
Bursty (10x peak-to-trough) Any Any API — idle silicon kills the maths
Mixed (70% steady, 30% bursty) 10+ engineers >1B tok/month Hybrid — reserved B200 baseline, API spillover
Research / experimental Any <100M tok/month API — the model will change anyway
Data sovereignty / regulated 10+ engineers, security lead Any Self-host, often on-prem rather than cloud

When you absolutely should NOT self-host yet

The cost-per-token table is seductive. It is also, on its own, deeply misleading. Here are the costs the table does not show.

  1. The platform-engineer week burn. A serving stack — vLLM, SGLang or TensorRT-LLM, plus a load balancer, plus an autoscaler, plus observability — is a 2–4 engineer-month build the first time, and a 0.3–0.5 FTE forever after. At a UK senior MLE rate of £100k loaded, that is £30–50k per year, every year, before you have served a token.
  2. The model swap-out cost. Every time a better open-weight model lands — and they land every six weeks now — the team has to re-benchmark, re-quantise, re-tune the serving config and re-run regression evals. Teams using APIs change the model string and move on. Teams self-hosting spend a week.
  3. The cold-start operational tax. GPU drivers, NVLink topology, NCCL tuning, kernel versions, container drift, the eternal CUDA-vs-PyTorch version dance. None of this is hard, exactly. All of it is a steady drumbeat of small fires. If your platform team is <3 people, this drumbeat is the entire job.
  4. Bursty traffic + reserved GPU = burn. A B200 at 12% utilisation costs the same as a B200 at 95% utilisation. The only thing that changes is your $/Mtok, which goes up by 8x.
Do not self-host

If your traffic is unpredictable, if your model will change in the next quarter, if your platform team is one person, or if your finance team will not commit to a 12-month reserve — stay on an API. The $0.02/Mtok number on the table is a number for steady, large, predictable workloads only. Misapply it and you will burn through six months of runway on idle silicon.

What the cost picture looks like by Q4 2026

The supply side of this story is the part most procurement decks miss. TSMC is mid-way through ramping Blackwell production at CoWoS-L; the bottleneck has been advanced packaging, not silicon. As that packaging capacity comes online through Q3 and Q4, two things happen.

First, B200 on-demand pricing at the major specialised AI clouds is expected to stabilise in the $2.50–$3.00/hr range — a roughly 40% drop from May 2026 on-demand levels. Second, the 36-month reserved floor will tighten further; expectations from procurement teams we have spoken to put it at $1.90–$2.10/hr by year-end on neocloud tier.

The implication for your buy-versus-rent decision is straightforward: do not buy B200 in May 2026 if you can wait six months. The street price will hold; the rental price will fall. Renting becomes increasingly the dominant choice as Blackwell supply normalises.

The implication for your self-host-versus-API decision is subtler. As B200 rental drops to $2.50–$3.00/hr on-demand, the API floor — already at Gemini Flash's $0.30/$2.50 — will be defended by the frontier labs cutting prices. We expect another round of API price cuts before Christmas, particularly from Google and the open-weight commercial labs. The cost gap between self-host and API narrows; the operational gap does not.

Serving-engine choice changes the maths more than you expect

One subtlety that finance teams almost always miss: the cost-per-token on a B200 is a strong function of which serving engine you run on top of it. TensorRT-LLM, vLLM and SGLang are not interchangeable.

  • TensorRT-LLM typically wins on raw tokens-per-second for FP8 inference on Blackwell, given that NVIDIA tunes it themselves. Early Blackwell-era benchmarks reported by NVIDIA's own performance team and independent neoclouds suggest a meaningful throughput edge over vLLM on well-shaped FP8 workloads — typically described in the 20–40% range — though the exact number is highly workload-dependent.
  • vLLM wins on flexibility, model coverage and the fastest path from "new open-weight model dropped" to "serving it in production". Throughput is competitive, not best-in-class.
  • SGLang wins on structured-output and tool-calling workloads. The RadixAttention prefix cache is materially better for repeated-prompt patterns.

The same B200 can swing $/Mtok by a factor of two depending on which engine you pick. That is a bigger lever than choosing between two different GPU rental providers. See our companion piece on serving engine choice when self-hosting for the full benchmark.

Want help running this calculation for your stack?

Every article on AI Tech Connect is written from the perspective of Verified Builders. Browse profiles to find infrastructure engineers and MLOps leads who have already shipped on Blackwell.

Browse Builders →

Indian and UK Builder takeaway

The B200 economics story plays differently in the two markets we cover, and the difference is worth being honest about.

For Indian shops, the rupee-denominated cost pressure is the dominant variable. A B200 reserved at $2.25/hr is roughly ₹1.4 lakh per month per GPU — a serious commitment for a sub-50-person studio, but a transformational unit-cost shift for any team running sustained open-weight inference. The opportunity is real: combine a reserved B200 with a self-hosted Qwen-class coding model or a sovereign Indian-language model, and your cost-per-customer for AI features drops by a factor of 10. Indian neoclouds — including the early Bharat-stack offerings — are pricing B200 aggressively to attract this cohort. Expect rupee-denominated reserved rates to look favourable through 2026.

For UK shops, the new variable is the Sovereign AI Fund and the Isambard-AI / DRI compute allocations. If your team qualifies, you may be able to access H100 / H200 / B200 capacity through national infrastructure at rates well below the commercial neocloud floor. The catch is access friction: the application processes are real, the queues are real, and the constraints on commercial use are real. For research-adjacent workloads and pre-product startups, the public compute route is genuinely competitive with the cheapest neocloud. For purely commercial production inference, the commercial market is still the right answer — but it is worth running the comparison.

The cross-market takeaway is the same: rent before you buy, reserve only what you can keep above 50% utilised, and do not let the $0.02/Mtok headline number seduce you into an infrastructure project your team is not sized for.

Source pricing references: Thunder Compute B200 pricing, Thunder Compute GPU rental market trends, Inworld B200 GPU cloud, IntuitionLabs pricing guide, GMI Cloud inference cost, NVIDIA Perspectives, Tech Insider Blackwell pricing.