In the first half of 2023, if you wanted to train or fine-tune a model seriously, you queued for H100s at above $8 per hour on the spot market — and you were grateful to get them. Demand from the post-ChatGPT frenzy had stripped every major cloud of available capacity. Startups ran waitlists. Infrastructure teams competed with hyperscalers on a thin inventory of Ampere and early Hopper chips. The cost of experimentation was high enough that it meaningfully constrained which teams could afford to build.

Two years later, that world is gone. H100 and H200 on-demand rental has settled at approximately $2.00–$2.50 per hour across major cloud providers, per SpendArk analysis published in May 2026. That is not a temporary dip — it is the result of a structural supply surge that has been building for 18 months and shows no sign of reversing. If anything, the next wave of supply is about to make the current prices look expensive by comparison.

This article is the practical guide to what this shift means: where the money went, what experiments and workloads are now economically viable for a team running on a budget, how Indian builders and UK startups should be thinking about reservations versus on-demand, and what the GB300 (Blackwell Ultra) wave means for the rest of 2026.

The Pricing Journey: From $8/hr Peak to $2/hr Today

The H100 SXM shortage of 2023 was genuine. NVIDIA's supply chain constrained production of the SXM modules that most serious training workloads required, and demand from hyperscalers — who were all scrambling to build out AI infrastructure simultaneously — absorbed the available supply before the retail spot market could access it. The result was scarcity pricing: spot H100s above $8/hr on major GPU clouds, and in some cases considerably higher for guaranteed on-demand access.

Several things then happened in sequence. First, NVIDIA significantly ramped production of H100 SXM and PCIe variants through 2023 and into 2024. Second, hyperscalers — having made their initial capacity purchases — began releasing excess inventory onto cloud marketplaces as their internal utilisation rates varied. Third, AMD's MI300X entered the market as a credible alternative for inference workloads, introducing competitive pressure that had not previously existed. Fourth, a new tier of specialist GPU cloud providers — RunPod, JarvisLabs, CoreWeave, and others — built out substantial infrastructure and competed aggressively on price to capture the developer and startup segment that the hyperscalers were not serving well.

The cumulative result is the $2–2.50/hr range we see today for on-demand H100. The ~70% fall from peak is not a correction to fair value — it is a structural re-pricing driven by supply growth that continues to outpace demand growth. And crucially, the next generation of supply is already in the ground.

Why GPU Prices Are Falling: The Supply Wave

Three deals announced or completed in the first half of 2026 illustrate the scale of new supply entering the market:

IREN and Microsoft: $9.7B, GB300 at scale. On 7 May 2026, IREN — one of the largest GPU cloud operators by capacity — announced a multi-year $9.7B agreement with Microsoft to deploy NVIDIA GB300 GPUs at its Childress, Texas data centre (reported by CNBC). This is not a modest infrastructure commitment. It is a single-site deployment that, when complete, will represent one of the largest concentrations of next-generation compute outside the hyperscalers' own campuses. When GB300 supply of this magnitude comes online, it does two things: it directly increases GB300 availability at competitive prices, and it pushes H100 inventory — no longer the newest chip — onto the spot market at even lower prices.

AWS: over one million NVIDIA GPUs. AWS's expanded NVIDIA partnership involves deploying over one million NVIDIA GPUs across its global infrastructure. The scale of this commitment means that AWS GPU capacity — across H100, H200, and Blackwell variants — will be substantially larger in H2 2026 than it is today. AWS reserved instances for GPU workloads are already pricing competitively; expect further movement as this capacity comes online.

Meta: $50B over multiple years. Meta's multi-year deal with NVIDIA, committed in February 2026, involves Blackwell and Rubin GPUs plus Grace and Vera CPUs — an estimated $50B commitment over the life of the agreement (CNBC). Meta is building this infrastructure for its own research and product needs, not as a cloud provider. But the knock-on effect is significant: every dollar Meta spends on Blackwell chips is a dollar's worth of chip supply that NVIDIA produces, which feeds the component pipeline and indirectly increases the supply available to cloud providers and new entrants.

AMD MI300X and the competitive dynamic. AMD's MI300X has established itself as a viable alternative for inference workloads, with competitive memory bandwidth and a pricing strategy that undercuts NVIDIA on comparable performance tiers. Microsoft Azure, Oracle Cloud, and several specialist providers now offer MI300X instances. This competitive presence is not displacing NVIDIA for training (where CUDA ecosystem lock-in remains strong), but it is putting a ceiling on inference pricing that NVIDIA providers cannot easily exceed.

New market entrants. Neysa — India's first sovereign GPU cloud at scale — has deployed 20,000-plus NVIDIA GPUs after closing a $1.2B Series B. We cover the Neysa deployment in detail in our dedicated piece on Neysa's $1.2B round and India sovereign GPU cloud. Neysa's pricing is broadly competitive with AWS Mumbai for H100 instances, and for Indian teams its data-residency proposition — compute that never leaves India — adds a compliance value that hyperscaler pricing does not capture.

What the Pricing Drop Means for Indian Builders

The practical impact is best understood through concrete experiments. Consider what $800 bought a serious Indian AI builder in early 2023 versus what it buys today.

In early 2023, $800 at $8/hr gave you 100 hours of H100 compute. That is enough time to fine-tune a 7B model once, run some evaluations, and iterate on a dataset — but not much more. For a bootstrapped Indian startup, $800 was a meaningful budget decision. You ran the experiment once, carefully, and hoped it worked.

Today, $800 at $2/hr gives you 400 hours of H100 compute — four times the experimental capacity for the same spend. That 7B fine-tuning run that used to cost $800 now costs $200. An overnight fine-tuning run — leave it running at 6pm, collect results at 8am — on a 7B model using QLoRA on a dataset of 50,000 examples costs between $15 and $30 at current H100 spot rates. That is within the discretionary budget of an individual developer, not just a funded startup.

This is the unlock that matters. When the cost of a serious training experiment falls below the cost of a developer's daily coffee budget, the experimentation rate explodes. Teams that were running one or two fine-tuning experiments per month can now run ten or twenty. The feedback loop from hypothesis to validated model compresses from weeks to days. For Indian builders working in domains where open-weight fine-tuning has direct product application — legal document processing, regional language NLP, code generation for Indian enterprise stacks — this is the most significant infrastructure development of 2026.

Pro tip

At $2/hr on a single H100, an overnight 8-hour QLoRA fine-tuning run on a 7B model costs $16. You can run this every night of the week for $112 — less than most team subscriptions to productivity software. Structure your experimentation cadence around this: daily hypothesis, overnight run, morning evaluation. The economics now support it.

What the Pricing Drop Means for UK Startups

For UK AI startups, the implications run through cloud budget allocation, on-premises ROI calculations, and the build-versus-buy decision on GPU infrastructure.

Cloud budgets go further. A UK startup that budgeted £50,000 for GPU cloud compute in 2024 can now run approximately twice the workload for the same spend — or run the same workload and redirect the savings to engineering time, data acquisition, or evaluation infrastructure. For teams using H100 on-demand for production inference, the savings are immediate and require no architectural change: the same cluster costs less.

The on-premises ROI equation shifts. At $8/hr H100, the case for buying on-premises GPU hardware was compelling for teams with consistent, predictable compute needs. The payback period on owned hardware was often under 18 months. At $2/hr cloud prices, the calculus changes materially. For inference-only workloads, cloud on-demand now competes favourably with owned hardware on a total-cost-of-ownership basis once you factor in power, cooling, space, and engineering maintenance. The use case for owned hardware narrows to: very high utilisation (above 80% 24/7), specific compliance requirements that mandate on-premises processing, or ultra-low-latency edge inference where network round-trip to cloud is unacceptable.

Azure PTU economics. For UK teams heavily committed to the Azure ecosystem running GPT-4o class workloads at volume, Azure's Provisioned Throughput Units (PTUs) still deliver up to 40% savings versus pay-as-you-go at high volume, per Azure pricing guidance. PTUs are a reservation mechanism: you commit to a throughput level and pay a blended rate that undercuts the per-call API price at scale. For production workloads with predictable call volume, this is still the right choice — the savings are real and do not depend on spot availability. For experimental or variable-volume workloads, standard pay-as-you-go remains more flexible.

GDPR and data residency. UK startups handling personal data through inference face an additional constraint that does not apply to most Indian teams in the same way. Azure UK South, AWS eu-west-2 (London), and Google Cloud europe-west2 provide GPU infrastructure with data residency in the UK — important for GDPR compliance in enterprise contracts. UK-region GPU pricing typically runs 10–20% above equivalent US-region pricing, but for enterprise sales where a data processing agreement (DPA) is a prerequisite, this premium is well justified. The good news: even with the UK-region premium, H100 on-demand in London is still dramatically cheaper than it was 18 months ago.

Practical Cost Table: What You Can Run at $2/hr

The following table gives concrete estimates for common training and inference workloads at current H100 on-demand pricing of $2/hr. All training figures assume a single H100 80GB with QLoRA or LoRA fine-tuning (full fine-tuning requires more GPU memory and time); inference figures assume a well-configured vLLM server with FP8 quantisation and good batch fill rates. Actual costs vary by dataset size, batch configuration, and provider.

Workload 1 hour ($2) 8 hours ($16) 24 hours ($48)
7B model fine-tuning (QLoRA, ~50K examples) Partial run (setup + ~30% of dataset) Full fine-tuning run with evaluation 3 full fine-tuning runs; dataset ablation
13B model inference (vLLM, FP8) ~400K–600K tokens generated; demo or small eval ~3.5M tokens; mid-size production eval suite ~10M tokens; large-scale batch processing pipeline
70B model inference (vLLM, FP8, single H100) ~80K–120K tokens; prototype smoke test ~700K tokens; reasonable eval or batch job ~2M tokens; nightly document processing pipeline
Embedding generation (sentence-transformers, 768-dim) ~5M–10M embeddings; full document corpus for a mid-size SaaS ~50M embeddings; large enterprise document library ~150M embeddings; full re-index of a large-scale RAG corpus
70B model fine-tuning (LoRA, multi-GPU needed) Requires 4–8× H100s ($8–16/hr); 8-hour run costs $64–128. Still ~70% cheaper than 2023 equivalent.

The embedding row is worth highlighting for RAG builders. At $2/hr, re-indexing an entire enterprise document corpus — tens of millions of documents — is a weekend job that costs under $50. In 2023, that same task might have cost $150–200. For teams iterating on their retrieval architecture — changing chunking strategies, embedding models, or document schemas — the cost of a full re-index is no longer a deterrent to experimentation.

For a deeper look at how inference economics interact with product profitability, see our companion piece on AI inference costs and building profitable AI products.

Watch out

Spot GPU availability remains uneven. The $2/hr figure reflects on-demand pricing at well-supplied providers; spot prices may be 20–40% lower but are subject to preemption. Enterprise deals — particularly for multi-node H100 clusters — are negotiated at rates that can differ significantly from retail pricing in both directions. If your workload requires guaranteed multi-node capacity, get quotes directly from providers rather than relying on published rate cards. A100 instances are now available at near-commodity pricing ($1.20–1.50/hr on-demand) and represent strong value for inference workloads that do not require the H100's memory bandwidth.

Reservation vs On-Demand: When to Commit Now

The falling price environment creates a genuine strategic tension around reservations. On one hand, reserved instances and committed-use contracts still deliver meaningful savings for steady-state workloads. On the other hand, committing to a 1–3 year reservation now means locking in today's price — which, based on the supply trajectory, may look expensive relative to H2 2026 rates when GB300 capacity comes online at scale.

The right framework depends on your workload profile:

Commit now if: your production inference workload runs at above 60% utilisation on a consistent basis; your cloud provider offers PTU-style reservations (Azure) that lock in throughput rather than raw GPU time; or your compliance requirements mandate a guaranteed data-residency SLA that spot instances cannot provide. Azure PTUs at 40% savings versus pay-as-you-go are still a strong proposition for high-volume GPT-4o inference — the saving is large enough that even if list prices fall 20% in H2 2026, the committed rate remains competitive.

Stay on-demand if: your workload is experimental or variable; you are in the process of evaluating model tiers and expect to change your stack within 6 months; or you are a small team running GPU compute only for fine-tuning and evaluation (rather than continuous inference). At $2/hr, the break-even between on-demand and reserved has shifted significantly — you now need very high utilisation to justify a multi-year commitment.

Prefer shorter commitments. If you do want to capture reservation savings, favour 3–6 month terms over 1–3 year terms in the current environment. The risk of being locked into a high rate while GB300-driven pricing falls further in Q3–Q4 2026 is real. A 6-month reserved H100 at a negotiated 15% discount captures meaningful savings while preserving the option to renegotiate or switch tiers when the next capacity wave lands.

Pro tip

For Indian teams, Neysa's pricing structure is worth benchmarking alongside AWS Mumbai and Azure India before making any reservation decision. Data residency within India — combined with competitive pricing on H100 instances — may shift the optimal reservation provider for teams where DPDP Act compliance is a factor. Do not assume AWS Mumbai is automatically the benchmark; get a quote from Neysa directly before committing.

The GB300 Wave: Why Prices May Fall Further in H2 2026

The NVIDIA GB300 — Blackwell Ultra, the next-generation successor to the H100 — is beginning to arrive in data centres in H2 2026. The IREN-Microsoft $9.7B deployment is one of the larger announced rollouts, but it is not alone. Every major cloud provider has committed to GB300 deployments in their publicly available road maps.

The GB300 delivers materially better performance than the H100 on training workloads — faster memory bandwidth, higher FP8 throughput, and better multi-node interconnect through NVLink 5. For frontier model training at large scale, it is the clear successor. For inference, the picture is more nuanced: the H100's performance is already very strong for most model sizes up to 70B, and the GB300's advantages become most pronounced at 100B-plus scale.

What does GB300 availability mean for H100 pricing? As enterprises and hyperscalers migrate their primary training workloads to GB300, H100 inventory gets freed up and increasingly flows onto the spot and on-demand retail market. This is the same dynamic that made A100 instances cheap in 2024 and 2025 as H100 adoption scaled. A100 on-demand is now available at $1.20–1.50/hr — a price that would have seemed implausible in 2022. H100 following a similar trajectory toward $1.50–2.00/hr in H2 2026 is not speculative; it is the logical consequence of the supply dynamics already in motion.

For a detailed look at what GB300's inference economics mean for builders running large-scale workloads, see our analysis of NVIDIA B300 inference economics for 2026.

Should you wait? For production infrastructure decisions with multi-year implications, there is a reasonable case for shorter commitments now — preserving the option to benefit from further price falls in Q4 2026 and 2027. For experimentation and fine-tuning budgets, do not wait. The economics of running experiments today at $2/hr are excellent, and the opportunity cost of delaying experiments while waiting for $1.50/hr is almost certainly larger than the savings you would capture by waiting.

Cloud Provider Comparison: Where to Run in May 2026

Not all GPU cloud providers are equally positioned in the current market. Below is a representative snapshot of H100 on-demand pricing and key considerations by provider as of May 2026. Prices are directional; always request current quotes directly.

Provider H100 On-Demand (approx.) Best for Key consideration
AWS (us-east-1) $2.20–$2.80/hr (p4de) Teams in the AWS ecosystem; large-scale distributed training 1M+ GPU deployment underway; capacity improving
Azure (East US) $2.40–$3.00/hr (NDv4) GPT-4o inference with PTUs; enterprises on Azure AD PTUs offer 40% savings at high volume; IREN GB300 deal adds future capacity
Google Cloud (us-central1) $2.50–$3.20/hr (a3-highgpu) JAX/TPU teams; Vertex AI pipeline integration TPU 8i often better value for compatible workloads
CoreWeave (US East) $2.00–$2.40/hr Cost-optimised training and inference; startup-friendly billing Good spot availability; strong multi-node SLA options
RunPod / JarvisLabs $1.89–$2.20/hr (on-demand) Experimentation, fine-tuning, individual developers Spot below $1.50/hr available; preemption risk; excellent for dev workloads
Neysa (India) Approx. AWS Mumbai parity Indian teams; DPDP Act compliance; sovereign data residency 20,000+ GPU fleet; $1.2B Series B; data never leaves India
AWS Mumbai (ap-south-1) $2.40–$3.00/hr (p4d) Indian teams already on AWS; latency-sensitive inference for IN users Spot availability tighter than US regions; Neysa worth benchmarking alongside

The specialist providers — CoreWeave, RunPod, JarvisLabs — consistently undercut hyperscaler pricing by 15–30% for comparable on-demand H100 access. For teams whose workloads do not require tight integration with hyperscaler services (managed databases, identity, monitoring pipelines), these providers are worth evaluating seriously. The engineering overhead of working outside the hyperscaler ecosystem is real but manageable for teams with dedicated infra engineers.

A100: Nearly Free, and Excellent for Inference

While this article focuses on H100 pricing, the A100's current market position deserves a mention. A100 80GB on-demand is available at $1.20–1.50/hr from specialist providers — a price that would have seemed implausible in 2022. For inference workloads running models up to approximately 30B parameters, the A100 remains a strong option. Its memory bandwidth and FP16 throughput are not dramatically inferior to the H100 for inference (as distinct from training), and the price difference is significant.

For Indian bootstrapped teams running inference for production products at small to mid scale, A100 spot instances in the $0.80–1.00/hr range represent arguably the best cost-per-output-token available on commodity hardware today. If your production model is a fine-tuned 7B or 13B, the A100 is more than sufficient — and the economics are extremely favourable.

For Builders in Both Markets: What to Do Now

Whether you are a two-person bootstrapped team in Bengaluru or a twelve-person AI squad at a Manchester scale-up, the actions that move the needle are the same:

Run the experiments you have been deferring. If there is a fine-tuning experiment, evaluation study, or dataset ablation you have been putting off because the compute cost felt prohibitive, run it now. At $2–2.50/hr, the mental model of "is this experiment worth the cost" should be recalibrated significantly downward. A weekend experiment that answers a meaningful architecture question for $50–80 is worth running even if the answer is negative.

Revisit your on-premises versus cloud calculation. If you made a buy-versus-rent decision 18 months ago at $5–6/hr cloud pricing, that calculation may have changed. Run it again with current rates. For inference-only workloads with moderate utilisation, cloud on-demand is increasingly competitive with owned hardware on total cost of ownership.

Benchmark Neysa (India) and CoreWeave (UK/US). If you have not already evaluated the specialist GPU cloud tier, do so before your next reservation decision. The pricing gap versus hyperscalers is significant and often not justified by the additional services unless you are tightly integrated with hyperscaler tooling.

Use QLoRA and FP8 as defaults. The days of justifying BF16 fine-tuning as the "safe" default are over. QLoRA and LoRA reduce GPU memory requirements by 60–80% relative to full fine-tuning with minimal quality loss for most downstream tasks. FP8 inference halves memory bandwidth requirements. Both are production-ready in 2026. If you are not using them, you are paying 2–4× more than necessary for equivalent output quality. See our deep-dive on DoRA fine-tuning with weight-decomposed LoRA for the current state of parameter-efficient fine-tuning techniques.

Plan your data residency now. For Indian teams: DPDP Act phase two is approaching, and the window for retrofitting data residency into existing architectures is narrowing. Neysa gives you a credible sovereign GPU cloud option today. For UK teams: GDPR data residency expectations in enterprise contracts are hardening. Audit your inference stack against your privacy policy and data processing agreements before the next enterprise renewal cycle.

The fundamental shift of 2026 is that GPU compute has moved from a scarce, expensive resource that constrained which teams could afford to build seriously — to an abundant, cheap resource whose economics now resemble commodity cloud infrastructure. The constraint on AI product quality has moved from compute budget to product insight, data quality, and evaluation discipline. For builders who have been watching and waiting, that is the signal to act.

For a broader view of how VC capital is flowing into AI infrastructure in this environment, see our coverage of Q1 2026's record $300B AI startup funding — and what it means for where compute capacity will be built next.

Building on GPU infrastructure? Find Builders who have done it.

AI Tech Connect's verified Builders include ML platform engineers, fine-tuning specialists, and infra leads who have shipped at scale in both India and the UK. Browse profiles to find the expertise you need.

Browse Builders →