Is the Vera Rubin NVL72 designed for inference or training?

Both, but with different access patterns. Rubin is primarily designed for frontier-scale training runs and very large inference deployments. Most startups will access it as a managed endpoint rather than renting raw hardware. The high memory bandwidth is particularly valuable for long-context inference and multi-modal workloads.

Can a startup afford Vera Rubin NVL72 compute?

Not directly — the NVL72 is a rack-scale unit rented as a whole or in large blocks. Startups will access Rubin-class inference via managed API endpoints on AWS, Google Cloud, and Azure in H2 2026, similar to how they access H100 today. Expect per-token API pricing rather than per-GPU-hour pricing for most use cases.

When will Vera Rubin arrive in India and the UK?

UK data residency is expected via AWS eu-west-2 (AWS UK) and Azure UK South in H2 2026, likely Q4 2026. India deployment is less certain — Neysa, which has secured $1.2B in financing, is a strong candidate for a 2027 Rubin deployment in India. The IndiaAI Mission compute programme may also fast-track access.

How does Vera Rubin NVL72 compare to the H200?

Rubin NVL72 is architecturally different — it is a rack-scale system with 72 GPUs tightly coupled by NVLink, designed for workloads that cannot be sharded across independent GPUs. The H200 is a single-GPU upgrade from the H100. For most inference workloads that fit within a single H200's memory, you will not see Rubin-class systems until late 2026 at the earliest via managed APIs.

NVIDIA Vera Rubin NVL72: A Builder's Guide to the Next GPU Era

What builders need to know before anything else

Rubin NVL72 is not a better H100 — it is an entirely different form factor. You rent a slice of a 72-GPU rack, not an individual GPU instance. Unit economics are different from anything you have signed today.
H2 2026 is the realistic window for access on AWS, Google Cloud, Azure, and OCI. CoreWeave, Lambda, and Nebius are also deploying. Plan your roadmap accordingly.
The inference-to-training ratio is flipping — specialised GPU cloud providers are currently running 70% training / 30% inference capacity, but this ratio is expected to reverse by end of 2026. Rubin accelerates that shift.
H100 spot prices have already collapsed to around $2/hr (from $8/hr in 2024). Any reservation contract you sign today on H200 hardware carries meaningful Rubin-arrival risk if the term extends past Q3 2026.

What is the Vera Rubin NVL72?

Named after the American astronomer Vera Rubin — whose observation of galaxy rotation curves provided the primary evidence for dark matter — NVIDIA's Vera Rubin platform is the successor to the Blackwell/H100 lineage. The NVL72 is the flagship form factor: 72 GPUs in a single rack-scale unit, tightly coupled by NVIDIA's fifth-generation NVLink fabric.

The NVLink interconnect is the architectural differentiator. In a conventional H100 or H200 server cluster, GPUs are connected via InfiniBand or Ethernet switches — fast, but with measurable inter-node latency and bandwidth constraints. In the NVL72, all 72 GPUs share a single unified memory space with NVLink bandwidth that is orders of magnitude higher than any PCIe or network fabric. The result is that a 72-GPU Rubin system behaves more like one very large GPU than 72 loosely connected ones.

This matters most for:

Frontier training runs — large-scale transformer pre-training that needs to parallelise across tens of thousands of parameters without hitting inter-node communication bottlenecks.
Long-context inference — serving models with very large key-value (KV) caches, where moving cache data between GPUs would otherwise dominate latency.
Multi-modal workloads — video understanding and generation jobs that need massive simultaneous memory bandwidth for vision and language streams.

Most startups will not rent raw NVL72 racks directly. The economics and operational complexity favour hyperscale operators. Instead, you will access Rubin-class compute via managed inference endpoints — the same way you call H100-backed Claude or GPT-4 today, without knowing which GPU your request lands on.

When and where: cloud availability timeline

The following deployment schedule is based on confirmed announcements from NVIDIA Newsroom, Google Cloud Blog, and partner press releases as of May 2026. Exact regional availability will be confirmed closer to launch.

Provider	Type	Expected Region(s)	Estimated Launch
AWS	Hyperscaler	us-east-1, eu-west-2 (UK)	Q3–Q4 2026
Google Cloud	Hyperscaler	us-central1, europe-west2	Q3–Q4 2026
Microsoft Azure	Hyperscaler	East US, UK South	Q3–Q4 2026
OCI (Oracle Cloud)	Hyperscaler	US Midwest, UK Gov	Q4 2026
CoreWeave	Specialised GPU cloud	US East, US West	Q3 2026
Lambda	Specialised GPU cloud	US (multiple AZs)	Q3–Q4 2026
Nebius	Specialised GPU cloud	EU (Amsterdam)	Q4 2026
Nscale	Specialised GPU cloud	EU (UK, Nordic)	Q4 2026
Neysa (India)	Regional GPU cloud	Mumbai, Hyderabad	2027 (projected)

One notable commercial signal: IREN Ltd — a publicly listed AI infrastructure operator — has signed a five-year, $3.4 billion AI cloud services contract with NVIDIA to deploy Rubin-based systems. Deals of this scale indicate that NVIDIA is locking in hyperscale rack-level commitments well ahead of the H2 2026 public launch window.

Google Cloud's announcement also included a noteworthy companion capability: Google Cloud Managed Lustre now delivers 10 TB/s of storage bandwidth — a 10× improvement and reportedly 20× faster than comparable hyperscaler offerings. The Inference Gateway has also been updated to cut time-to-first-token (TTFT) latency by more than 70% for certain model sizes. Both improvements compound the Rubin memory bandwidth advantage for inference workloads.

Memory bandwidth: what it means for inference

If you have been optimising inference pipelines on H100 or H200 hardware, you will be familiar with the two dominant latency constraints: prefill time (processing the input tokens) and decode time (generating each output token). Memory bandwidth is the physical limit that governs both.

In a transformer model, every forward pass reads the model weights and the KV cache from GPU memory. For a 70-billion-parameter model, the weights alone occupy roughly 140 GB at fp16 precision. The KV cache for a 128k-token context adds tens of gigabytes more. If you are running four such models on a GPU cluster to serve concurrent requests, you are constantly saturating memory bandwidth — and that saturation directly translates into latency.

Time-to-first-token (TTFT) is particularly sensitive to this. TTFT measures the delay between a user sending a prompt and receiving the first output token. For a 32k-token system prompt plus 2k-token user query, the prefill phase must read the entire 34k-token context through memory bandwidth before generating a single output token. On an H100 with 3.35 TB/s of memory bandwidth, this is a measurable bottleneck. The Vera Rubin architecture significantly increases this figure.

For builders running long-context workloads — legal document review, code-analysis agents, customer-conversation pipelines — this is not a marginal improvement. A model that returns the first token in 400 ms rather than 1,200 ms is qualitatively different in a chat interface. It is the difference between a product that feels responsive and one that feels slow, regardless of how fast the subsequent tokens stream.

Pro tip

If your inference workload is TTFT-sensitive — chatbots, coding assistants, live document editors — benchmark explicitly against the managed Rubin endpoints when they become available in Q3 2026. Do not assume the improvement will be proportional to the raw bandwidth specs; test your specific context length and model size. The gains are real but depend heavily on your batching strategy and KV cache management.

Training vs inference: the fleet flip coming in late 2026

Specialised GPU cloud providers — CoreWeave, Lambda, Crusoe, and similar operators — built their businesses primarily on training workloads. As of early 2026, the typical fleet composition across these providers is approximately 70% training capacity, 30% inference capacity. This asymmetry reflects where GPU cloud revenue has historically come from: large foundation model labs running multi-month pre-training runs.

That ratio is expected to invert by end of 2026, and Vera Rubin is a significant catalyst for the flip. Several dynamics are converging:

The foundation model pre-training market is concentrating — a smaller number of very large labs are running the biggest training runs. Commodity training compute is increasingly competed away by H100 price declines (now around $2/hr on spot, down from $8/hr in 2024). This undercuts the revenue model for providers who competed primarily on training cost.
Inference demand is growing faster than training demand — every production AI application is an inference workload. As enterprise adoption accelerates through 2026, the volume of inference requests dwarfs training compute needs across the ecosystem.
Rubin's rack-scale architecture is better suited to large inference clusters than to individual training runs. A single NVL72 rack can serve inference traffic that would have required a loosely coupled cluster of H100s, with lower operational complexity and better utilisation.

The practical implication for builders: the GPU cloud market is restructuring around inference as the primary use case. If you are evaluating a long-term GPU reservation with a specialised provider, ask explicitly how they are repositioning their fleet for inference. A provider still optimised for training throughput may not be the right partner for a production inference workload in 2027.

Watch out

Providers that do not adapt to inference-centric workloads will face margin compression as H100 spot prices remain depressed and training revenue consolidates around a handful of large lab customers. Choose partners with clear inference-oriented roadmaps — look for Rubin deployment timelines, managed inference endpoint offerings, and SLA commitments on TTFT, not just raw throughput.

Should you commit to H200 contracts now?

This is the question that matters most to builders who are currently evaluating GPU reservations. The honest answer is: it depends on your workload, your timeline, and your tolerance for lock-in risk. Here is the framework.

First, the Rubin context. H100 spot prices have already fallen to around $2/hr — a 75% decline from the $8/hr peak in 2024. H200 reserved pricing (1-year terms) from major providers currently sits in the $3–5/hr range for a standard 8-GPU instance. When Rubin NVL72 capacity becomes available in H2 2026, it will initially carry a significant premium — early Rubin pricing will be higher than H200, not lower, because demand will outstrip supply. So the immediate economics do not automatically favour waiting.

The risk of committing today is not that Rubin will be cheaper — it is that Rubin will be qualitatively better for your specific workload, making H200 reservations feel expensive relative to what you could accomplish with Rubin-class inference endpoints at comparable cost. If your product depends on long-context inference quality or latency, you may find that H200 contracts signed in mid-2026 look like expensive legacy commitments by Q1 2027.

Scenario	Recommendation	Rationale
Short-context inference (<8k tokens), latency not critical	Sign H200 6-month reservation now	Rubin offers no meaningful advantage; H200 is available and cost-competitive
Long-context inference (32k+ tokens), TTFT-sensitive product	Wait for Rubin managed endpoints (Q3–Q4 2026)	Memory bandwidth improvement directly reduces TTFT; worth a 3–6 month wait
Training run <1B parameters	Use H100 spot (currently ~$2/hr)	Rubin is overkill; spot H100 is excellent value for small runs
Training run 10B+ parameters, ongoing	Negotiate 6-month H200 term with renewal option	Avoid 12-month lock-in; preserve optionality for Rubin migration
Multi-modal (video + language) at scale	Wait for Rubin; use managed APIs in the interim	NVL72 architecture is specifically suited to multi-modal bandwidth demands
Compliance-sensitive (UK data residency required)	AWS eu-west-2 or Azure UK South H200 now; migrate to Rubin Q4 2026	Rubin UK regions confirmed; bridge with H200 while waiting

Pro tip

When negotiating reserved GPU contracts today, push hard for 6-month terms rather than 12-month terms — even if the provider offers a discount for the longer commitment. The discount rarely justifies the lock-in risk when a fundamentally new architecture is arriving within the reservation window. If a provider will not offer 6-month terms, treat that inflexibility as a signal about their confidence in Rubin-era pricing.

GPU architecture comparison: H100 / H200 / Vera Rubin NVL72

Spec / Factor	H100 (SXM5)	H200 (SXM5)	Vera Rubin NVL72
GPU memory	80 GB HBM3	141 GB HBM3e	72× GPU unified (NVLink pool)
Memory bandwidth	3.35 TB/s per GPU	4.8 TB/s per GPU	Significantly higher (rack-scale)
Form factor	Single GPU / 8-GPU server	Single GPU / 8-GPU server	72-GPU rack-scale unit
Connectivity	NVLink 4 / InfiniBand	NVLink 4 / InfiniBand	NVLink 5 (unified memory fabric)
Best for	Training, standard inference	Long-context inference, training	Frontier training, large-scale inference, multi-modal
Cloud spot price (May 2026)	~$2/hr (8-GPU)	~$3–5/hr reserved	TBC — premium at launch
Availability	Widely available, spot surplus	Available on major clouds	H2 2026 on AWS, GCP, Azure, OCI
Rental unit	Individual GPU / server	Individual GPU / server	Rack slice or managed endpoint

Building on GPU cloud infrastructure? Connect with verified Builders.

AI Tech Connect is the directory for Indian and UK AI Builders. Browse profiles, shortlist who you want to hire or collaborate with — infrastructure specialists, ML engineers, and inference optimisation experts.

Browse Builders →

India and UK builder playbook

Access to Vera Rubin hardware will not be uniform across geographies, and the timing gap matters to builders who need to plan infrastructure decisions today.

India

India's GPU cloud ecosystem has developed rapidly through 2025 and early 2026. Neysa — India's leading GPU cloud operator — closed a $1.2 billion financing round earlier in 2026, giving it the capital to pursue aggressive hardware upgrades. Neysa currently operates H100-class hardware across Mumbai and Hyderabad, and a Rubin NVL72 deployment in 2027 is plausible given the financing scale and NVIDIA's stated focus on emerging-market hyperscale partnerships.

The IndiaAI Mission's national compute programme is a second potential pathway. The Mission has been allocating compute capacity to domestic AI labs and startups, and any Rubin procurement under that programme would give Indian builders access to Rubin-class infrastructure without routing through a US-domiciled hyperscaler. Watch for IndiaAI Mission procurement announcements in H2 2026 as a signal of whether Rubin will reach Indian sovereign compute infrastructure before 2027.

For Indian builders who need GPU compute now: the H100 spot market is excellent value at current prices. For inference workloads that are not extremely latency-sensitive, Neysa's H100-class instances are cost-competitive and keep data within Indian jurisdictions — relevant for applications handling user data subject to India's Digital Personal Data Protection Act.

For Indian builders who need long-context inference at scale and cannot wait for domestic Rubin availability: Google Cloud's asia-south1 (Mumbai) region is the most viable bridging option. Google's Managed Lustre announcement and the Inference Gateway latency improvements compound meaningfully even on current H200-class hardware in that region.

United Kingdom

UK data residency requirements are well-catered for in the Rubin roadmap. AWS eu-west-2 (AWS UK, London) and Azure UK South (London) are both expected to be among the first EU/UK regions to receive Rubin NVL72 deployments, likely in Q4 2026. This is consistent with NVIDIA's history of prioritising UK regions given the concentration of AI labs and financial services AI investment in London.

Nscale — a UK-headquartered GPU cloud provider — has been building out European capacity and is listed among the early Rubin deployment partners. For builders who prefer a UK-domiciled provider over a US hyperscaler for data residency reasons, Nscale is worth evaluating. Their Nordic and UK capacity is expected to include Rubin hardware in Q4 2026.

UK builders evaluating GPU contracts today should note that the UK's current regulatory environment — including the AI Safety Institute's ongoing model evaluations and the draft Frontier AI Bill — may introduce compliance requirements around which infrastructure providers are permissible for certain classes of AI workload. If your application touches regulated sectors (financial services, healthcare, legal), build data residency and provider flexibility into your infrastructure design before committing to multi-year contracts.