The headline numbers, and the trap
Two numbers landed on the desks of AI infra leads this week and they appear, at first glance, to contradict each other. The first comes from NVIDIA and SemiAnalysis: a GB300 NVL72 system running agentic AI workloads delivers up to 35× lower cost-per-token than the Hopper generation it replaces, alongside 50× higher throughput per megawatt. The second comes from the cloud price boards: on-demand B300 capacity has risen from roughly $5.00/hr in November 2025 to $9.16/hr on certain providers — an 83% jump in six months.
If the chip is so much cheaper to run, why has renting one become so much more expensive? The answer is the central question of every infrastructure decision being taken in Bengaluru, Hyderabad, London and Manchester right now. Cost-per-token is a measurement of the silicon at full saturation. Cost-per-hour is a measurement of the market. The two only agree if you, the buyer, organise your workload to keep the GPU pinned at the utilisation curve where the 35× number actually shows up. Most teams will not.
This piece is the maths. We will walk through what B300 actually costs to own and to rent, where the 35× figure comes from, the decision framework for when Blackwell Ultra saves you money and when it bleeds you, two worked examples — one Indian SaaS, one UK fintech — and a practical migration plan from Hopper.
Lock in a reserved B300 contract this quarter. The 83% on-demand spike from $5/hr to $9.16/hr is the cloud providers signalling persistent supply tightness. Reserved capacity at roughly $3.40/hr is currently being negotiated against the old baseline; once procurement teams catch up, that gap will close. Buy the option now, exercise it through 2027.
What B300 actually costs to own — list, DGX, station
B300 is NVIDIA's Blackwell Ultra GPU — the second-half-of-Blackwell refresh that pushes HBM3e capacity, FP4 throughput and NVLink scale-out further than B200, and is the silicon underneath the GB300 NVL72 system the hyperscalers are now racking. There are three ways to own one.
The first is the loose GPU itself. List pricing sits in the $40,000–$50,000 band per unit, a meaningful premium over B200's roughly $35,000 street price. In practice no one outside an OEM buys a single B300 — supply is allocated to system integrators — so the list price functions as a reference for amortisation maths rather than a real purchase route.
The second is the DGX B300 system: eight B300 GPUs, ConnectX-8 networking, the full NVLink fabric, in a single 8U chassis. List sits in the $300,000–$350,000 band depending on RAM and storage spec. This is the production unit for any team that intends to run inference at scale on-premises or in colocation.
The third is the DGX Station — a tower form factor pairing a single B300 with a Grace CPU, designed for an AI engineer's desk rather than a data centre rack. List $80,000–$125,000. It exists for two markets: research labs that want a personal training box, and enterprises that want a pilot rig before committing to a full DGX.
Hardware cost — B300 vs B200 vs H100 (street estimates per GetDeploying / Tech-insider)
| Form factor | B300 (Blackwell Ultra) | B200 (Blackwell) | H100 (Hopper) |
|---|---|---|---|
| Single GPU (list) | $40,000–$50,000 | ~$35,000 | ~$25,000 |
| 8-GPU DGX system | $300,000–$350,000 | ~$275,000 | ~$200,000 |
| Workstation / Station | $80,000–$125,000 | n/a (DGX-only) | n/a |
| Cost per PFLOP (FP8/FP4 mixed) | Lower (improved over B200) | ~$4,167 | ~$5,053 |
The cost-per-PFLOP line is the one that matters. B200 already delivered a 17.5% improvement on H100 ($4,167 vs $5,053). B300 nudges that further — but the real gain is not raw FLOPs, it is FP4 throughput plus HBM3e bandwidth, which is precisely the combination that long-context agentic inference is bottlenecked on. The headline 35× cheaper-per-token figure is essentially the cost-per-PFLOP improvement multiplied by the agentic-workload memory-bandwidth advantage, integrated over a full NVL72 rack.
What B300 actually costs to rent — on-demand vs reserved vs spot
Cloud pricing for B300 has bifurcated dramatically. The headline rates from providers tracked by GetDeploying and ComputePrices look like this.
B300 cloud pricing — May 2026
| Tier | Per-GPU rate | Notes |
|---|---|---|
| On-demand (recent baseline) | $4.50–$5.80/hr | Median across smaller GPU specialists |
| On-demand (current spike, certain providers) | $9.16/hr | Up 83% from $5.00/hr (November 2025) |
| Reserved (12-month) | ~$3.40/hr | Subject to commit volume; multi-year lower |
| Spot / interruptible | From $4.59/hr | Limited capacity; agentic-loop unfriendly |
| November 2025 baseline | $5.00/hr | For comparison only |
The maths to internalise: reserved is currently 2.7× cheaper than the on-demand spike, and the spread is wider than at any point in the H100 era. Whether your provider charges $4.50/hr or $9.16/hr depends almost entirely on how desperate the next ten customers in their queue are. Both vendors and end users are reporting supply constraints, and the cloud price is the visible artefact of that constraint.
The FRT (fault-tolerant runtime) software stack for B300 is still maturing. TensorRT-LLM, NIM containers and the GB300-specific CUDA primitives have shipped meaningful behaviour changes month-to-month through Q1 2026. OEM supply for non-hyperscaler buyers remains thin — quoted lead times of 16–24 weeks are common. Build a 90-day test programme on rented B300 capacity before signing any multi-year colocation deal.
Where 35× cheaper tokens comes from
The SemiAnalysis InferenceX dataset that NVIDIA cites in its agentic-AI economics post is a careful bit of work. The 35× figure is not a marketing flourish — it is the integral of three compounding effects when a GB300 NVL72 rack is fully utilised on agentic workloads.
First, raw FP4 throughput. Blackwell Ultra adds dense FP4 support that lets a single B300 deliver a meaningful uplift over B200 on FP4-quantised workloads, per NVIDIA's published B300 spec sheet. For agentic inference where every step is a forward pass, this is the dominant gain.
Second, HBM3e bandwidth. Long-context agentic loops — the sort of multi-turn, tool-calling workloads that GPT-5.5 and Claude Opus 4.7 enable — spend most of their wall-clock time moving KV cache around, not crunching matmuls. B300's memory bandwidth uplift turns what was a memory-bound workload on H100 into a compute-bound one on B300, which is the regime where silicon advantage actually shows up.
Third, NVLink scale-out at the rack level. The GB300 NVL72 lets 72 B300s behave as a single 72-way coherent accelerator, which means very large models — and very large agent populations — can run without going off-rack. Off-rack hops are the thing that murders Hopper-era inference economics.
Throughput per megawatt and cost-per-token (qualitative)
| Generation | Throughput / MW (vs H100 baseline) | Cost-per-token (vs H100 baseline) | Best regime |
|---|---|---|---|
| H100 (Hopper) | 1.0× | 1.0× | General-purpose inference |
| B300 (Blackwell Ultra) | Up to 50× | Up to 35× cheaper | Agentic AI, long-context, NVL72 rack-scale |
The honest caveat: the 35× and 50× figures are upper bounds. Real workloads sit somewhere between H100 baseline and B300 ceiling depending on context length, batch size, quantisation and rack topology. A back-of-envelope rule we use: a moderately optimised B300 deployment delivers 8–15× cost-per-token improvement over H100 in production. That is still the largest single-generation jump in GPU economics in a decade.
When B300 saves you money — and when it doesn't
The decision framework comes down to four numbers about your workload: utilisation, context length, batch size, and the burstiness of demand.
B300 saves you money when (a) your GPU runs at 60% or higher utilisation, (b) average context length is over 32k tokens, (c) you can batch at least 16 requests in flight, and (d) demand is roughly steady week-to-week. This is the regime where the 35×-cheaper-tokens figure starts to manifest. It describes most production LLM inference for serious SaaS and most agentic platform workloads.
B300 costs you money when (a) utilisation drops below 35%, (b) context is short and bursty (chatbots, autocomplete), (c) you cannot batch above four in flight, or (d) demand is so spiky you must provision for peak. Here, B300 on-demand at $9.16/hr — even at 35× efficiency — loses to a fleet of H100s at $2/hr that you can scale down to zero. Idle Blackwell Ultra is the most expensive way to run nothing.
The dual-market lens matters here. A cost-sensitive Indian SaaS optimising for unit-economics margin should treat B300 as a precision instrument: deploy it on the steady, high-margin slice of the workload (long-context document processing, agentic flows) and keep H200 or B200 as the elastic peak-handler. A UK fintech optimising for peak-burst — say, a compliance review pipeline that runs 4 hours a day across the month-end — should rent B300 reserved-hourly rather than on-demand, and run synthetic load through it overnight to amortise the slot.
Worked example: an Indian SaaS running 50M tokens/day
Take a realistic Indian B2B SaaS. Daily inference load is 50 million tokens, split roughly 70% input (32k average context, document-heavy) and 30% output (long-form summarisation). Steady demand across business hours, lighter overnight. Today they run on H100 reserved at roughly $2.20/hr (reserved-rate benchmark per GetDeploying H100 listings), with two GPUs and 75% combined utilisation, costing ~$1,584/month per GPU, ~$3,168/month total.
Migrate the workload to a single B300 reserved at $3.40/hr. Throughput-per-GPU on this exact context-length-plus-batch profile is roughly 10× the H100 baseline (well within the 8–15× rule of thumb above). One B300 absorbs the load at 65% utilisation, leaving 35% headroom for growth. Monthly cost: $3.40 × 24 × 30 = $2,448/month. That is a roughly 23% reduction on infrastructure cost, while doubling headroom and halving p99 latency on long-context calls.
If they had instead rented B300 on-demand at the $9.16/hr spike: $9.16 × 24 × 30 = $6,595/month. More expensive than the H100 baseline, despite the 10× silicon advantage. The lesson: it is the contract, not the silicon, that determines whether you win.
Worked example: a UK fintech running batch RAG every 4 hours
A London-based fintech runs a RAG-over-regulation pipeline every four hours: pull the new EU AI Act, FCA, and Bank of England updates, embed, retrieve, summarise into a structured compliance report. Each run takes 90 minutes on current H100 hardware. Six runs a day, nine GPU-hours/day, 270 GPU-hours/month. The rest of the day, the GPU sits idle.
On H100 reserved at $2.20/hr full-time (reserved-rate benchmark per GetDeploying H100 listings): $2.20 × 24 × 30 = $1,584/month, but actual utilisation is only 270/720 = 37.5%. Effective rate per useful hour: $5.87.
Move to B300 reserved-hourly at $3.40/hr, paying only for the 270 useful hours: $3.40 × 270 = $918/month. We estimate the B300 roughly halves the runtime — assuming the 8–15× cost-per-token rule of thumb above translates into a comparable wall-clock gain on this workload — so each batch finishes in approximately 45 minutes and you pay for around 135 hours: $459/month. That is an estimated 71% saving on cost and a 2× speed-up on cycle time, which means compliance reports land within the regulator-relevant window.
The same workload on on-demand B300 at $9.16/hr × 135 hours = $1,237/month — still cheaper than the H100 baseline, but the reserved-hourly contract is a clear winner if your provider offers it. Negotiate this aggressively.
What to negotiate with your cloud provider this quarter
The 83% on-demand spike is your single best piece of leverage in a procurement conversation. Three asks to bring to the table.
First, reserved-hourly billing. Standard reserved contracts assume 24/7 utilisation. For most enterprise workloads, that is fictional. Push for the right to commit hours-per-month rather than capacity-blocks. Some specialist providers will agree; the hyperscalers will resist but can be moved on multi-year deals.
Second, price-protection clauses. With on-demand having moved 83% in six months, a reserved contract without a price-floor guarantee is risky. Ask for a clause that holds your committed rate for 12 months and only allows true-up if the provider's published list price drops by 20% or more.
Third, cross-generation portability. Reserved contracts that lock you to a specific SKU for two years are a trap when B400 lands. Negotiate the right to move your committed spend to whichever Blackwell-or-later generation the provider offers at the time of consumption, with a published conversion ratio.
Three caveats: supply, lock-in, FRT software stack churn
Three things will determine whether your B300 economics survive contact with reality.
Supply. Both vendors and end users are reporting constraints. NVIDIA's allocation favours hyperscalers and Tier-1 system integrators; if you are buying through a smaller channel, expect 16–24 week lead times and upward price pressure. Plan procurement on a six-month horizon, not a six-week one.
Lock-in. The Blackwell software stack — TensorRT-LLM, NIM, the GB300-specific CUDA primitives — is far enough ahead of AMD and Intel that the practical alternative is zero. This is not a complaint; it is a fact to plan around. Build your inference layer with provider-portable abstractions (vLLM, TGI, Triton) so that if 2027 brings a credible MI400 or Gaudi-4 alternative, you can move.
FRT software stack churn. The fault-tolerant runtime, the scheduler primitives, the NCCL configurations for NVL72 — all of this has shipped breaking changes through Q1 2026 and will continue to. Pin versions in production. Run a parallel staging rack on the bleeding-edge stack. Do not let your provider dictate the upgrade cadence.
A practical migration plan (H100/H200 → B200 → B300)
Three-stage migration, executed over six to nine months for most teams.
Stage 1 — validate on B200 (months 1–3). Move one production model and one staging environment to B200, ideally on rented hourly capacity. Measure end-to-end p50 / p95 / p99 latency, tokens-per-second-per-watt, and cost-per-token against your H100 baseline. Iron out the TensorRT-LLM and quantisation issues here. B200 has roughly six months more software-stack maturity than B300, so the bugs you hit on B200 are bugs you would also hit on B300, plus more. Catch them on B200.
Stage 2 — steady-state on B300 reserved (months 4–6). Once B200 is clean, move the workload that benefits most from long-context-and-agentic — your highest-margin or highest-latency-sensitive pipeline — to B300 on a 12-month reserved contract. Keep B200 capacity in parallel as both fallback and elastic peak-handler. This is also the window in which to negotiate the contract terms above.
Stage 3 — retire Hopper (months 7–9). Wind down H100/H200 reservations as they expire. Do not rush this — Hopper is a perfectly good elastic-burst tier for at least another year, and the residual value of your existing reservations is non-trivial. By month nine you should have B300 carrying the steady-state agentic workload, B200 carrying mid-context inference and elastic peaks, and Hopper retired or sold to a regional cloud.
The teams that win the next 12 months of inference economics will be the ones that treat B300 as a precision tool rather than a wholesale replacement. The 35× figure is real. So is the $9.16/hr trap. Pick which side of that gap you want to be on.
Primary references: NVIDIA's Blackwell Ultra agentic-AI economics post, Spheron's B300 specs and pricing guide, GetDeploying's B300 cloud pricing comparison, ComputePrices' GB300 tracker, Tech-insider's Blackwell pricing roundup, Silicon Data's 2026 GPU pricing trends, and GMI Cloud's cost-of-inference primer.
For the model side of this equation, see our coverage of GPT-5.5's API economics, DeepSeek V4's cost-effective frontier approach, Google TurboQuant's 6× KV-cache compression, and the broader open-weight April release wave. For market context on where the capital is moving, our pieces on SpaceX's $60B Cursor option, Sarvam's multilingual stack, and Anthropic's finance-agent push at Goldman, Blackstone and FIS round out the picture.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →