What builders need to know before anything else

  • Rubin NVL72 is not a better H100 — it is an entirely different form factor. You rent a slice of a 72-GPU rack, not an individual GPU instance. Unit economics are different from anything you have signed today.
  • H2 2026 is the realistic window for access on AWS, Google Cloud, Azure, and OCI. CoreWeave, Lambda, and Nebius are also deploying. Plan your roadmap accordingly.
  • The inference-to-training ratio is flipping — specialised GPU cloud providers are currently running 70% training / 30% inference capacity, but this ratio is expected to reverse by end of 2026. Rubin accelerates that shift.
  • H100 spot prices have already collapsed to around $2/hr (from $8/hr in 2024). Any reservation contract you sign today on H200 hardware carries meaningful Rubin-arrival risk if the term extends past Q3 2026.

What is the Vera Rubin NVL72?

Named after the American astronomer Vera Rubin — whose observation of galaxy rotation curves provided the primary evidence for dark matter — NVIDIA's Vera Rubin platform is the successor to the Blackwell/H100 lineage. The NVL72 is the flagship form factor: 72 GPUs in a single rack-scale unit, tightly coupled by NVIDIA's fifth-generation NVLink fabric.

The NVLink interconnect is the architectural differentiator. In a conventional H100 or H200 server cluster, GPUs are connected via InfiniBand or Ethernet switches — fast, but with measurable inter-node latency and bandwidth constraints. In the NVL72, all 72 GPUs share a single unified memory space with NVLink bandwidth that is orders of magnitude higher than any PCIe or network fabric. The result is that a 72-GPU Rubin system behaves more like one very large GPU than 72 loosely connected ones.

This matters most for:

  • Frontier training runs — large-scale transformer pre-training that needs to parallelise across tens of thousands of parameters without hitting inter-node communication bottlenecks.
  • Long-context inference — serving models with very large key-value (KV) caches, where moving cache data between GPUs would otherwise dominate latency.
  • Multi-modal workloads — video understanding and generation jobs that need massive simultaneous memory bandwidth for vision and language streams.

Most startups will not rent raw NVL72 racks directly. The economics and operational complexity favour hyperscale operators. Instead, you will access Rubin-class compute via managed inference endpoints — the same way you call H100-backed Claude or GPT-4 today, without knowing which GPU your request lands on.

When and where: cloud availability timeline

The following deployment schedule is based on confirmed announcements from NVIDIA Newsroom, Google Cloud Blog, and partner press releases as of May 2026. Exact regional availability will be confirmed closer to launch.

Provider Type Expected Region(s) Estimated Launch
AWS Hyperscaler us-east-1, eu-west-2 (UK) Q3–Q4 2026
Google Cloud Hyperscaler us-central1, europe-west2 Q3–Q4 2026
Microsoft Azure Hyperscaler East US, UK South Q3–Q4 2026
OCI (Oracle Cloud) Hyperscaler US Midwest, UK Gov Q4 2026
CoreWeave Specialised GPU cloud US East, US West Q3 2026
Lambda Specialised GPU cloud US (multiple AZs) Q3–Q4 2026
Nebius Specialised GPU cloud EU (Amsterdam) Q4 2026
Nscale Specialised GPU cloud EU (UK, Nordic) Q4 2026
Neysa (India) Regional GPU cloud Mumbai, Hyderabad 2027 (projected)

One notable commercial signal: IREN Ltd — a publicly listed AI infrastructure operator — has signed a five-year, $3.4 billion AI cloud services contract with NVIDIA to deploy Rubin-based systems. Deals of this scale indicate that NVIDIA is locking in hyperscale rack-level commitments well ahead of the H2 2026 public launch window.

Google Cloud's announcement also included a noteworthy companion capability: Google Cloud Managed Lustre now delivers 10 TB/s of storage bandwidth — a 10× improvement and reportedly 20× faster than comparable hyperscaler offerings. The Inference Gateway has also been updated to cut time-to-first-token (TTFT) latency by more than 70% for certain model sizes. Both improvements compound the Rubin memory bandwidth advantage for inference workloads.

Memory bandwidth: what it means for inference

If you have been optimising inference pipelines on H100 or H200 hardware, you will be familiar with the two dominant latency constraints: prefill time (processing the input tokens) and decode time (generating each output token). Memory bandwidth is the physical limit that governs both.

In a transformer model, every forward pass reads the model weights and the KV cache from GPU memory. For a 70-billion-parameter model, the weights alone occupy roughly 140 GB at fp16 precision. The KV cache for a 128k-token context adds tens of gigabytes more. If you are running four such models on a GPU cluster to serve concurrent requests, you are constantly saturating memory bandwidth — and that saturation directly translates into latency.

Time-to-first-token (TTFT) is particularly sensitive to this. TTFT measures the delay between a user sending a prompt and receiving the first output token. For a 32k-token system prompt plus 2k-token user query, the prefill phase must read the entire 34k-token context through memory bandwidth before generating a single output token. On an H100 with 3.35 TB/s of memory bandwidth, this is a measurable bottleneck. The Vera Rubin architecture significantly increases this figure.

For builders running long-context workloads — legal document review, code-analysis agents, customer-conversation pipelines — this is not a marginal improvement. A model that returns the first token in 400 ms rather than 1,200 ms is qualitatively different in a chat interface. It is the difference between a product that feels responsive and one that feels slow, regardless of how fast the subsequent tokens stream.

Pro tip

If your inference workload is TTFT-sensitive — chatbots, coding assistants, live document editors — benchmark explicitly against the managed Rubin endpoints when they become available in Q3 2026. Do not assume the improvement will be proportional to the raw bandwidth specs; test your specific context length and model size. The gains are real but depend heavily on your batching strategy and KV cache management.

Training vs inference: the fleet flip coming in late 2026

Specialised GPU cloud providers — CoreWeave, Lambda, Crusoe, and similar operators — built their businesses primarily on training workloads. As of early 2026, the typical fleet composition across these providers is approximately 70% training capacity, 30% inference capacity. This asymmetry reflects where GPU cloud revenue has historically come from: large foundation model labs running multi-month pre-training runs.

That ratio is expected to invert by end of 2026, and Vera Rubin is a significant catalyst for the flip. Several dynamics are converging:

  1. The foundation model pre-training market is concentrating — a smaller number of very large labs are running the biggest training runs. Commodity training compute is increasingly competed away by H100 price declines (now around $2/hr on spot, down from $8/hr in 2024). This undercuts the revenue model for providers who competed primarily on training cost.
  2. Inference demand is growing faster than training demand — every production AI application is an inference workload. As enterprise adoption accelerates through 2026, the volume of inference requests dwarfs training compute needs across the ecosystem.
  3. Rubin's rack-scale architecture is better suited to large inference clusters than to individual training runs. A single NVL72 rack can serve inference traffic that would have required a loosely coupled cluster of H100s, with lower operational complexity and better utilisation.

The practical implication for builders: the GPU cloud market is restructuring around inference as the primary use case. If you are evaluating a long-term GPU reservation with a specialised provider, ask explicitly how they are repositioning their fleet for inference. A provider still optimised for training throughput may not be the right partner for a production inference workload in 2027.

Watch out

Providers that do not adapt to inference-centric workloads will face margin compression as H100 spot prices remain depressed and training revenue consolidates around a handful of large lab customers. Choose partners with clear inference-oriented roadmaps — look for Rubin deployment timelines, managed inference endpoint offerings, and SLA commitments on TTFT, not just raw throughput.

Should you commit to H200 contracts now?

This is the question that matters most to builders who are currently evaluating GPU reservations. The honest answer is: it depends on your workload, your timeline, and your tolerance for lock-in risk. Here is the framework.

First, the Rubin context. H100 spot prices have already fallen to around $2/hr — a 75% decline from the $8/hr peak in 2024. H200 reserved pricing (1-year terms) from major providers currently sits in the $3–5/hr range for a standard 8-GPU instance. When Rubin NVL72 capacity becomes available in H2 2026, it will initially carry a significant premium — early Rubin pricing will be higher than H200, not lower, because demand will outstrip supply. So the immediate economics do not automatically favour waiting.

The risk of committing today is not that Rubin will be cheaper — it is that Rubin will be qualitatively better for your specific workload, making H200 reservations feel expensive relative to what you could accomplish with Rubin-class inference endpoints at comparable cost. If your product depends on long-context inference quality or latency, you may find that H200 contracts signed in mid-2026 look like expensive legacy commitments by Q1 2027.

Scenario Recommendation Rationale
Short-context inference (<8k tokens), latency not critical Sign H200 6-month reservation now Rubin offers no meaningful advantage; H200 is available and cost-competitive
Long-context inference (32k+ tokens), TTFT-sensitive product Wait for Rubin managed endpoints (Q3–Q4 2026) Memory bandwidth improvement directly reduces TTFT; worth a 3–6 month wait
Training run <1B parameters Use H100 spot (currently ~$2/hr) Rubin is overkill; spot H100 is excellent value for small runs
Training run 10B+ parameters, ongoing Negotiate 6-month H200 term with renewal option Avoid 12-month lock-in; preserve optionality for Rubin migration
Multi-modal (video + language) at scale Wait for Rubin; use managed APIs in the interim NVL72 architecture is specifically suited to multi-modal bandwidth demands
Compliance-sensitive (UK data residency required) AWS eu-west-2 or Azure UK South H200 now; migrate to Rubin Q4 2026 Rubin UK regions confirmed; bridge with H200 while waiting
Pro tip

When negotiating reserved GPU contracts today, push hard for 6-month terms rather than 12-month terms — even if the provider offers a discount for the longer commitment. The discount rarely justifies the lock-in risk when a fundamentally new architecture is arriving within the reservation window. If a provider will not offer 6-month terms, treat that inflexibility as a signal about their confidence in Rubin-era pricing.

GPU architecture comparison: H100 / H200 / Vera Rubin NVL72

Spec / Factor H100 (SXM5) H200 (SXM5) Vera Rubin NVL72
GPU memory 80 GB HBM3 141 GB HBM3e 72× GPU unified (NVLink pool)
Memory bandwidth 3.35 TB/s per GPU 4.8 TB/s per GPU Significantly higher (rack-scale)
Form factor Single GPU / 8-GPU server Single GPU / 8-GPU server 72-GPU rack-scale unit
Connectivity NVLink 4 / InfiniBand NVLink 4 / InfiniBand NVLink 5 (unified memory fabric)
Best for Training, standard inference Long-context inference, training Frontier training, large-scale inference, multi-modal
Cloud spot price (May 2026) ~$2/hr (8-GPU) ~$3–5/hr reserved TBC — premium at launch
Availability Widely available, spot surplus Available on major clouds H2 2026 on AWS, GCP, Azure, OCI
Rental unit Individual GPU / server Individual GPU / server Rack slice or managed endpoint

Building on GPU cloud infrastructure? Connect with verified Builders.

AI Tech Connect is the directory for Indian and UK AI Builders. Browse profiles, shortlist who you want to hire or collaborate with — infrastructure specialists, ML engineers, and inference optimisation experts.

Browse Builders →

India and UK builder playbook

Access to Vera Rubin hardware will not be uniform across geographies, and the timing gap matters to builders who need to plan infrastructure decisions today.

India

India's GPU cloud ecosystem has developed rapidly through 2025 and early 2026. Neysa — India's leading GPU cloud operator — closed a $1.2 billion financing round earlier in 2026, giving it the capital to pursue aggressive hardware upgrades. Neysa currently operates H100-class hardware across Mumbai and Hyderabad, and a Rubin NVL72 deployment in 2027 is plausible given the financing scale and NVIDIA's stated focus on emerging-market hyperscale partnerships.

The IndiaAI Mission's national compute programme is a second potential pathway. The Mission has been allocating compute capacity to domestic AI labs and startups, and any Rubin procurement under that programme would give Indian builders access to Rubin-class infrastructure without routing through a US-domiciled hyperscaler. Watch for IndiaAI Mission procurement announcements in H2 2026 as a signal of whether Rubin will reach Indian sovereign compute infrastructure before 2027.

For Indian builders who need GPU compute now: the H100 spot market is excellent value at current prices. For inference workloads that are not extremely latency-sensitive, Neysa's H100-class instances are cost-competitive and keep data within Indian jurisdictions — relevant for applications handling user data subject to India's Digital Personal Data Protection Act.

For Indian builders who need long-context inference at scale and cannot wait for domestic Rubin availability: Google Cloud's asia-south1 (Mumbai) region is the most viable bridging option. Google's Managed Lustre announcement and the Inference Gateway latency improvements compound meaningfully even on current H200-class hardware in that region.

United Kingdom

UK data residency requirements are well-catered for in the Rubin roadmap. AWS eu-west-2 (AWS UK, London) and Azure UK South (London) are both expected to be among the first EU/UK regions to receive Rubin NVL72 deployments, likely in Q4 2026. This is consistent with NVIDIA's history of prioritising UK regions given the concentration of AI labs and financial services AI investment in London.

Nscale — a UK-headquartered GPU cloud provider — has been building out European capacity and is listed among the early Rubin deployment partners. For builders who prefer a UK-domiciled provider over a US hyperscaler for data residency reasons, Nscale is worth evaluating. Their Nordic and UK capacity is expected to include Rubin hardware in Q4 2026.

UK builders evaluating GPU contracts today should note that the UK's current regulatory environment — including the AI Safety Institute's ongoing model evaluations and the draft Frontier AI Bill — may introduce compliance requirements around which infrastructure providers are permissible for certain classes of AI workload. If your application touches regulated sectors (financial services, healthcare, legal), build data residency and provider flexibility into your infrastructure design before committing to multi-year contracts.

Watch out

Do not commit to 12-month H200 reserved contracts at current pricing if your workload is long-context inference and your contract term runs past Q2 2027. By that point, managed Rubin inference endpoints will likely offer meaningfully better TTFT at comparable or lower per-token cost. The window for a 6-month H200 bridge while Rubin ramps is reasonable; a 12-month commitment that overlaps significantly with the Rubin availability window is not.

Related reading on AITC

For context on how H100 spot prices reached their current level, see our earlier guide: H100 GPU Price Decline 2026: A Builder's Guide. If you are evaluating the economics of NVIDIA's B300 architecture alongside Rubin, see NVIDIA B300 Inference Economics 2026. For a broader view of inference cost trajectories and how to build profitable AI products at current pricing, see AI Inference Costs 2026: Building Profitable Products.

On the India-specific GPU cloud landscape, our Neysa coverage is essential context: Neysa $1.2B Series B: India's GPU Cloud Comes of Age. For comparison with the specialised inference cloud market, see DeepInfra $107M Series B: The Inference Cloud Consolidation.