Why 5% utilisation is normal — and why it shouldn't be

The figure seems almost impossible. Enterprises have collectively committed hundreds of billions of dollars to GPU infrastructure — H100 clusters, reserved cloud instances, co-location racks — yet the average utilisation rate across the installed base sits at roughly 5%, according to VentureBeat's analysis of GPU cloud market data. The "$401B problem" framing from that analysis is blunt but accurate: if your hardware is busy 5% of the time, 95% of your capital expenditure is producing nothing.

To understand why this happened, it helps to trace the decision chain. In 2023 and 2024, the fear of missing out on AI was genuinely rational. GPU lead times stretched to six months. NVIDIA allocations required executive relationships. Enterprises that did not commit early risked being locked out entirely. So procurement teams went big, often buying for projected peak load rather than actual need. The hardware arrived. The software to fully utilise it often did not.

Compare this with how hyperscalers operate. Google, Microsoft, and Amazon achieve 70–85% GPU utilisation across their inference fleets through continuous batching, dynamic routing, and model multiplexing — techniques that most enterprise AI teams have not yet implemented. The gap is not a hardware problem. It is an operational maturity problem, and it is solvable.

The workload shift compounds the issue. Specialised GPU cloud providers such as CoreWeave, Lambda, and Crusoe currently see approximately 70% of their capacity consumed by training workloads and 30% by inference. By end of 2026, that ratio is widely expected to invert — 70% inference, 30% training — as enterprises complete their initial training runs and move to production deployment. The infrastructure many organisations bought for training is poorly suited to continuous inference serving, and few teams have bridged that gap.

The five root causes of idle GPU capacity

Idle GPU capacity in enterprise environments is rarely caused by a single failure. It is almost always the compound result of several interacting problems. Understanding each one separately is the first step to addressing them systematically.

1. Overprovisioning for peak load. Enterprise procurement is built around worst-case scenarios. A team estimates its busiest conceivable day — perhaps the first week after a product launch, or end-of-quarter compliance runs — and buys hardware to service that load without queuing. The result is a cluster that runs at full capacity for three days a year and sits idle the other 362. Hyperscalers solve this through statistical multiplexing across thousands of tenants; a single enterprise cannot replicate that without a fundamentally different approach to capacity sharing.

2. Batch workload mismatch. Training runs are periodic, not continuous. A fine-tuning job that runs for 72 hours consumes GPUs intensively during that window, then releases them. Unless another workload is scheduled to fill that gap immediately, the hardware sits idle. Most enterprise ML teams do not yet have sophisticated job schedulers that pack workloads tightly. The result is large gaps between training jobs where GPUs spin at near-zero utilisation.

3. Fragmented management without central orchestration. As AI projects proliferated inside large organisations, different teams provisioned their own GPU resources independently. The computer vision team has a cluster. The NLP team has a different one. The data science platform team has a third. None of these clusters can see each other's queue depths. None can lend idle capacity to a neighbour under load. This fragmentation means even organisations with individually utilised clusters have a collective utilisation rate far below what unified scheduling would achieve.

4. Model staleness — GPUs waiting for models that haven't shipped. A surprisingly common cause of idle hardware is procurement getting ahead of deployment. A cluster is bought and provisioned. The model it was meant to serve is still in fine-tuning, undergoing safety review, or blocked on an internal approval process. Weeks pass. The GPUs sit idle waiting for work that has not materialised. This is particularly common in regulated industries such as banking and healthcare, where model approval timelines extend to months.

5. Shadow AI and ungoverned provisioning. Individual teams, frustrated by slow central procurement, used cloud credits or departmental budgets to provision their own GPU instances. This shadow AI infrastructure is invisible to central IT. It runs at low utilisation because it was sized for a single team's needs. When that team's project winds down, the instances often keep running — no one thinks to terminate them because no one centrally knows they exist. The billing arrives at the end of the month, attributed to an opaque cost centre, and no one investigates.

6. Naive inference deployment without optimisation. Even organisations that do deploy models to production frequently do so in the most straightforward way: a single model loaded onto a single GPU, handling one request at a time. This naive pattern achieves 3–8% utilisation on an A100 under typical enterprise request patterns. Without continuous batching, dynamic routing, or KV cache sharing, the GPU spends most of its time waiting between requests rather than computing.

The governance gap: shadow AI meets enterprise procurement

The fragmentation and shadow AI problems are fundamentally governance problems, not technology problems. They emerge when an organisation's AI investment decisions are distributed across many teams but accountability for the resulting infrastructure costs is unclear.

Microsoft's Agent 365 platform, launched on 1 May 2026 at $15 per user per month, explicitly targets this gap. One of its stated design goals is to convert shadow AI — ungoverned GPU spend and unregistered model deployments — into a governed asset class. The platform provides a unified inventory of AI workloads across an enterprise, maps them to underlying compute resources, and surfaces utilisation metrics at an organisational level rather than a team level.

The concept is sound even if you are not adopting Agent 365. The underlying insight — that GPU utilisation is a governance metric before it is a technology metric — applies universally. If no one in the organisation has a complete view of what GPUs are running, what they are running, and what the utilisation rate is, no amount of technical optimisation will fix the structural problem. Shadow deployments will continue to proliferate. Procurement will continue to overprovision. The 5% average will persist.

The governance architecture that works in practice has three components: a centralised inventory of all GPU resources (including cloud instances provisioned by individual teams), a unified scheduler or request router that can direct workloads across the full pool, and utilisation visibility at both the team level (for accountability) and the organisational level (for capacity planning). Tools such as Microsoft Agent 365, Weights and Biases compute tracking, and open-source alternatives like Volcano scheduler for Kubernetes provide starting points. The technology is less important than the governance commitment — someone in the organisation must own GPU utilisation as a metric and have the authority to act on it.

Watch out

Buying reserved GPU capacity before you have a utilisation baseline is the fastest way to guarantee a poor return on investment. Establish your actual average load using spot or on-demand instances first. Reserve only after you have three months of utilisation data showing consistent demand.

A six-step optimisation playbook

The following techniques are ordered by implementation effort, starting with the changes that deliver the fastest improvement for the least engineering work. A team that implements all six should expect to reach 40–60% utilisation from a starting point of 5%, representing a 6× to 12× improvement in hardware return.

Step 1 — Enable continuous batching with vLLM

A naive inference server handles one request at a time. The GPU computes the response, returns it, then sits idle until the next request arrives. Continuous batching changes this: the server fills the GPU with multiple concurrent requests and dynamically adds new sequences as slots free up within the batch. The result is near-constant GPU occupancy during serving hours.

vLLM implements continuous batching out of the box. The critical configuration parameters are:

# vllm_config.yaml
model: meta-llama/Llama-3-70B-Instruct
tensor_parallel_size: 4          # split across 4 GPUs
max_num_seqs: 256                 # concurrent sequences in flight
max_num_batched_tokens: 32768     # tokens processed per forward pass
gpu_memory_utilization: 0.90     # fraction of VRAM allocated to KV cache
enable_chunked_prefill: true      # overlap prefill and decode phases
max_paddings: 256                 # tolerated padding tokens per batch
quantization: awq                 # 4-bit weights — halves VRAM, minimal quality loss

With this configuration on a 4×A100-80GB node, a 70B model serving typical enterprise request patterns (128–512 token outputs) achieves 35–55% MFU versus 4–7% for a naive single-request server. The configuration change alone — no code rewrite required — represents the largest single utilisation improvement available.

Pro tip

Set gpu_memory_utilization to 0.90 rather than the default 0.85. The extra 5% VRAM headroom goes to KV cache, which directly increases the number of concurrent sequences you can hold. On an 80GB A100, this yields approximately 30% more concurrent capacity.

Step 2 — Model multiplexing with LoRA adapters

Many enterprise deployments run one fine-tuned model per GPU — a customer service model on GPU A, a document summarisation model on GPU B, a code assistant on GPU C. This is extraordinarily wasteful. LoRA (Low-Rank Adaptation) fine-tuning produces small adapter weights (typically 10–500MB) that can be swapped dynamically on top of a shared base model. A single A100 can host one base model and dozens of LoRA adapters, serving requests for any of them by loading the relevant adapter on demand.

# Example: vLLM LoRA multiplexing
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Load a single base model
llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    enable_lora=True,
    max_lora_rank=64,
    max_loras=16,                      # up to 16 adapters in VRAM simultaneously
    max_cpu_loras=32,                  # additional adapters on CPU for fast swap
)

# Route requests to the appropriate adapter
def generate_with_adapter(prompt: str, task: str) -> str:
    adapter_map = {
        "customer-service": LoRARequest("cs-adapter", 1, "/adapters/customer-service"),
        "summarisation":    LoRARequest("summ-adapter", 2, "/adapters/summarisation"),
        "code-assist":      LoRARequest("code-adapter", 3, "/adapters/code"),
    }
    lora_request = adapter_map.get(task)
    outputs = llm.generate(
        [prompt],
        SamplingParams(max_tokens=512),
        lora_request=lora_request,
    )
    return outputs[0].outputs[0].text

This consolidation pattern is one of the highest-leverage changes available to teams running multiple fine-tuned models. Three GPUs running one model each at 8% utilisation becomes one GPU running three models at 24% — before any other optimisation. Combined with continuous batching, the same GPU can reach 50–70% utilisation.

Step 3 — Spot instances for training workloads

Training runs are interruptible by design — you checkpoint and resume. This makes them ideal candidates for spot (preemptible) instances, which cost 60–70% less than reserved equivalents on most GPU cloud platforms. The only requirement is a robust checkpointing strategy: save model state every 15–30 minutes so that an interruption loses at most one checkpoint's worth of compute.

On AWS, a p4d.24xlarge (8×A100-40GB) costs approximately $32/hr on-demand and $9–12/hr on spot. A 72-hour training run costs $2,304 on-demand and $648–864 on spot. For a team running four training jobs per month, spot scheduling saves $5,800–$6,624 per month before any other change.

Pro tip

Use AWS's Fault-Tolerant Training on EKS or GCP's TPU preemption-aware checkpointing to automate checkpoint-and-resume. The engineering overhead is one to two days. The payback, at enterprise training volumes, is typically under a week.

Step 4 — Capability-based request routing

Not all inference requests require a 70B model. A request to classify a support ticket into one of twelve categories can be handled by a 7B or even a 3B model with near-identical accuracy. A request to draft a complex legal clause genuinely requires a larger model. Routing requests based on estimated complexity — using a lightweight classifier or simple rule engine — can cut inference costs by 40–60% while maintaining quality on demanding tasks.

Google's Cloud Inference Gateway implements ML-driven capacity-aware routing and has demonstrated TTFT (time to first token) reductions of over 70% by selecting the lowest-latency available model for each request. The same principle applies on-premise: a lightweight router in front of your inference fleet that directs simple requests to smaller, faster models is one of the highest-leverage architectural investments available.

Step 5 — KV cache compression and optimisation

The KV (key-value) cache stores intermediate attention computations and is the primary consumer of GPU VRAM during inference. A full KV cache enables the GPU to serve longer sequences without recomputation, but an oversized or poorly managed cache crowds out batch capacity.

Two complementary techniques improve KV cache efficiency significantly. Google's TurboQuant compresses KV cache entries from FP16 to INT4 or INT8 precision, reducing VRAM consumption by 4–6× with minimal quality impact. FlashAttention-3 restructures the attention computation to be IO-bound rather than memory-bound, reducing the effective KV cache footprint during forward passes. Together, these allow the same VRAM to hold a larger batch of sequences, directly improving throughput and utilisation.

Step 6 — Autoscale inference clusters to zero

Enterprise AI workloads have pronounced daily and weekly patterns. A document processing pipeline might peak at 9am–5pm on weekdays and sit idle overnight and at weekends. An internal coding assistant has near-zero demand outside business hours. Keeping GPUs running continuously for these workloads means paying for 168 hours of GPU time to serve perhaps 40–50 hours of actual demand.

Kubernetes-based autoscaling, combined with KEDA (Kubernetes Event-Driven Autoscaling) and GPU node pools configured with fast cold-start images, can scale inference clusters down to zero replicas during idle periods and back up within 90–120 seconds when requests arrive. For workloads that can tolerate a brief cold-start latency — most internal tools and non-customer-facing pipelines — this is straightforward to implement and eliminates idle overnight GPU spend entirely.

GPU waste by deployment pattern

The following table illustrates how utilisation and cost-per-request change as optimisation techniques are layered on top of a baseline naive deployment. Figures are based on an A100-80GB serving a 13B parameter model at typical enterprise request volumes (500 requests/hour sustained).

Deployment pattern GPU utilisation Requests/hr (A100) Cost per 1k requests Monthly GPU spend (500 req/hr)
Naive (single-request, no batching) 4–7% ~180 $4.20 $15,120
+ Continuous batching (vLLM) 35–50% ~1,400 $0.54 $1,944
+ LoRA multiplexing (4 adapters) 55–65% ~1,800 $0.42 $1,512
+ KV cache compression (INT8) 60–70% ~2,100 $0.36 $1,296
+ Capability routing (small model for 40% of requests) 65–75% (combined fleet) ~2,800 (effective) $0.21 $756
+ Autoscale to zero (nights/weekends off) 65–75% (during active hours) ~2,800 (effective) $0.14 $504

The journey from naive deployment to a fully optimised stack reduces monthly GPU spend by approximately 97% for the same request volume — from $15,120 to $504 on a single A100. At enterprise scale, across dozens of GPUs, the numbers become material to infrastructure budgets.

Working on GPU infrastructure for an enterprise?

AI Tech Connect connects infrastructure builders with organisations navigating exactly these problems. Browse verified Builders with GPU cluster and inference optimisation experience.

Browse Builders →

India and UK enterprise patterns

The 5% utilisation problem is global, but the path to fixing it looks different in different markets. Understanding the specific context in India and the UK matters for builders advising organisations in those regions.

India: IT services firms and the IndiaAI Mission. TCS, Infosys, Wipro, and HCL have all announced significant GPU cluster investments in 2024–2025 to support AI services offerings for their clients. These investments were driven partly by competitive pressure — a credible AI services pitch now requires demonstrable GPU infrastructure — and partly by the IndiaAI Mission's push to build domestic AI compute capacity. The challenge is that much of this infrastructure was procured before the workloads to fill it were clearly defined. Training runs for client engagement models are periodic. Inference deployments for client-facing products are still nascent. The result is large GPU clusters in Mumbai and Bangalore data centres running at single-digit utilisation.

The opportunity for builders advising these firms is substantial. A utilisation improvement programme for an IT services firm with 200 GPU nodes delivers the same economic benefit as buying 190 nodes of new capacity, but at zero capex cost. The conversation to have with a CTO at one of these organisations is not "how do you buy more GPUs" but "what is your current utilisation baseline, and what would 30% utilisation mean for your margins."

The IndiaAI Mission's National AI Compute initiative is adding shared GPU infrastructure accessible to startups and research institutions. This pooled model inherently achieves higher utilisation than enterprise-owned clusters and is worth factoring into advice for Indian founders who do not yet need dedicated infrastructure.

UK: public sector and financial services. UK public sector bodies — NHSX successor programmes, HMRC, the Ministry of Defence's AI Centre — committed to GPU capacity in 2024 ahead of deployment plans. UKRI infrastructure investments, similarly, were sized for anticipated demand that has materialised more slowly than expected. The UK financial services sector (Barclays, HSBC, Lloyds) bought GPU capacity aggressively in anticipation of regulatory approval for AI models in customer-facing applications; those approvals have moved at regulatory pace rather than technology pace, leaving clusters underutilised.

The UK regulatory environment — FCA guidance on AI model risk, ICO requirements on automated decision-making — means that model approval timelines extend the period between GPU procurement and production deployment. This is the "model staleness" root cause in its UK-specific form. Builders advising UK financial institutions should flag this dynamic explicitly: buying GPU capacity before the regulatory approval timeline is understood commits capital that will sit idle for months.

From a verified Builder

"We were brought in to review a UK insurer's AI infrastructure. They had 48 A100s running at under 4% utilisation. The model they bought the hardware for was still in their internal risk committee. Enabling continuous batching on the two models that were in production took us three days and took utilisation to 38%. The hardware was never the problem."

— Arjun, Verified Builder · London, UK / Bengaluru, IN

Measuring your own utilisation: the three metrics that matter

Before implementing any of the optimisation techniques above, you need a baseline. Without measurement, you cannot know which root cause is dominant in your environment, and you cannot demonstrate improvement to the stakeholders who control the infrastructure budget. Three metrics provide a complete picture of GPU utilisation health.

1. MFU (Model FLOP Utilisation). MFU measures the fraction of the GPU's theoretical peak floating-point throughput that is actually being used. An A100-80GB has a theoretical peak of 312 TFLOPS (BF16). If your inference server is delivering 15 TFLOPS of useful computation, your MFU is 4.8%. MFU is the gold standard metric — it is independent of workload patterns and directly reflects hardware efficiency. Measure it using NVIDIA's DCGM (Data Center GPU Manager) exporter with a Prometheus/Grafana stack.

2. Queue depth — average requests waiting per GPU. A queue depth of zero means the GPU is idle. A queue depth persistently above 5–10 means you are capacity-constrained and requests are being delayed. The ideal operating point is a queue depth of 1–3 — the GPU is always busy but requests are not waiting excessively. Queue depth data is available from vLLM's metrics endpoint at /metrics in Prometheus format.

3. Idle time percentage. What fraction of each 24-hour period do your GPUs sit at under 5% MFU? This metric surfaces the batch-gap and autoscaling opportunity. If GPUs are idle from 10pm to 8am every day, that is 10 hours (42% of each day) where autoscaling to zero would eliminate cost entirely. Measure using nvidia-smi dmon or the DCGM exporter's DCGM_FI_DEV_GPU_UTIL gauge.

A Prometheus scrape config that captures all three:

# prometheus.yml — scrape vLLM metrics and DCGM GPU metrics
scrape_configs:
  - job_name: vllm
    static_configs:
      - targets: ['inference-server:8000']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: dcgm
    static_configs:
      - targets: ['dcgm-exporter:9400']
    scrape_interval: 15s
    # Key gauges: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL,
    #             DCGM_FI_PROF_GR_ENGINE_ACTIVE

Run this stack for two weeks before making any infrastructure changes. The data will tell you which root cause to address first — overprovisioning, batch gaps, naive inference, or shadow AI — more reliably than any external benchmark.

For broader context on where GPU economics are heading as the market matures, see our guides on H100 GPU price trends, NVIDIA B300 inference economics, and the NVIDIA Vera Rubin architecture. On the software side, the LLM cost reduction techniques guide covers prompt caching and semantic KV cache patterns that complement the hardware utilisation improvements above.