What does PagedAttention actually do differently?

PagedAttention splits the KV cache into fixed-size blocks (pages) and maps logical positions to non-contiguous physical blocks. That reduces fragmentation and lets you pack more concurrent sequences into the same GPU memory footprint — which is why vLLM became the operational default in the first place.

What value should I set for --gpu-memory-utilization on an H100?

Start at the default of 0.9 and walk it up incrementally — 0.92, 0.94, 0.95 — while monitoring for OOM at peak concurrency. The bigger your KV cache headroom needs to be (long contexts, bursty concurrency), the lower you stay. Anything above 0.95 risks OOM on cold cache misses or sudden concurrency spikes.

Should I use vLLM or SGLang on H100 in 2026?

Use vLLM as the default — it has the largest operator surface, most documentation, broadest model coverage, and the cleanest production tooling. Reach for SGLang when your workload is prefix-heavy (long shared system prompts, agentic loops with shared scaffolding) and you have engineering bandwidth to operate it. PremAI's measurements show SGLang at roughly 16,200 tok/s versus vLLM at 12,500 tok/s on smaller models, but raw throughput is rarely the constraint that matters in production.

Do I really need KEDA, or is the Horizontal Pod Autoscaler enough?

HPA scales on CPU and memory, which are the wrong signals for an LLM serving pod — the bottleneck is GPU utilisation, queue depth, and request rate. KEDA lets you scale on Prometheus metrics (queued requests, time-to-first-token, GPU utilisation) which are the signals that actually correlate with user-visible latency. If you only have a couple of pods and traffic is steady, HPA is fine; once you have bursty traffic or multiple model variants, KEDA pays for itself.

vLLM 0.9 on H100: PagedAttention + Docker/KEDA Stack

Why vLLM is still the operational default in 2026

vLLM is not the throughput leader anymore. PremAI's measurements on H100 put SGLang at roughly 16,200 tokens per second against vLLM's 12,500 on the smaller-model benchmarks that get quoted in launch posts. TensorRT-LLM goes further still when you accept the engine-rebuild tax. So why does vLLM remain the default that most teams in Bengaluru and London actually run in production?

Because raw throughput is almost never the constraint that matters. What matters is the operator surface: how many models you can serve without a code change, how predictable scaling behaviour is, how good the documentation is when a junior engineer takes the pager at 2 a.m., and how easily the stack composes with the rest of your platform — ingress, autoscaler, observability, certificate rotation. On every one of those axes vLLM is in front, and that compounds into less downtime and faster iteration than a 25 per cent throughput gap can claw back.

The other reason is consolidation. Most of the open-source inference ecosystem — model adapters, quantisation toolkits, structured-output schemas, speculative decoding plumbing — lands in vLLM first. If you pick vLLM you inherit a year-long head start on tooling. If you pick SGLang you inherit a faster engine and a smaller pool of off-the-shelf solutions to operational problems.

Pro tip

Pick the engine that minimises your operator workload, not the one that wins the throughput benchmark on a blog post. The cost of an extra engineer-week is larger than the cost of a few extra GPU-hours per month, especially on rented H100s at $2–$3 per hour.

The four prerequisites that bite teams in week one

vLLM is forgiving on the happy path and unforgiving when one of the prerequisites is even slightly off. We see four mismatches over and over in incident reports — driver too old, CUDA toolchain wrong, container toolkit missing, Compose version too low. Pin the floor versions before you do anything else.

Component	Minimum version	Why it matters
NVIDIA driver	525	FP8 paths and recent CUDA features fail silently on older drivers
CUDA toolkit	12.1	Required by vLLM 0.9 wheels and FlashAttention-3 kernels
NVIDIA Container Toolkit	1.14	Older versions break GPU passthrough on Compose V2
Docker Engine + Compose V2	23.0	Earlier Compose lacks `device_requests` GPU shorthand
Kubernetes	1.27	Required for the NVIDIA GPU Operator features vLLM relies on
KEDA	v2.x	ScaledObject CRD and Prometheus scaler stability
cert-manager	latest stable	With a `letsencrypt-prod` ClusterIssuer for ingress TLS

Watch out

Cloud-provider GPU images often ship with a driver pinned older than 525 for stability. Check nvidia-smi on first boot — if the driver is older than 525, request a newer image rather than try to upgrade in place. We have lost more than one weekend to driver upgrades that broke containerd.

PagedAttention — the actual reason vLLM exists

The attention mechanism behind every transformer keeps a key-value cache for every token in every active sequence. In the naive implementation that cache lives in one contiguous chunk per sequence. The problem with contiguous allocations is fragmentation: when sequences have different lengths and finish at different times, the GPU memory map ends up looking like a Swiss cheese, and you cannot start a new sequence even when the total free memory would otherwise allow it.

PagedAttention takes the same trick the operating system uses for virtual memory and applies it to the KV cache. The cache is split into fixed-size blocks — pages — and the runtime maintains a logical-to-physical mapping. A sequence's pages do not need to be contiguous in physical memory; they only need to be tracked. The result is that you can pack many more concurrent sequences into the same GPU memory than a naive implementation would allow, and the system degrades gracefully under load instead of OOMing the moment fragmentation crosses a threshold.

The visible effect for a Builder is a higher steady-state batch size, more predictable tail latency, and a lower probability that a single long request will block shorter ones. None of that comes from raw kernel speed — it comes from better memory accounting. That accounting is what you are buying when you pick vLLM, and it is the bedrock that all of the tuning below sits on.

Tuning gpu-memory-utilization without OOMing

The single most important knob on a vLLM deployment is --gpu-memory-utilization. It controls the fraction of GPU memory vLLM is allowed to claim for the KV cache and the model weights. The default is 0.9, which is conservative enough to survive most workloads. It is also the value that leaves the most performance on the table.

The right approach is a controlled walk-up. Start at 0.9 and run your real workload, not a synthetic benchmark — the latter rarely captures the burst patterns or the long-tail context lengths that cause OOMs in production. Watch nvidia-smi for memory headroom and the vLLM logs for cache-eviction events. Then bump the value to 0.92, repeat, and continue until you are within one OOM event of the ceiling. For most teams the sweet spot lands between 0.92 and 0.95.

Recommended

Combine the gpu-memory-utilization walk-up with explicit --max-num-seqs and --max-model-len flags. These three knobs together let you trade KV cache footprint, concurrency ceiling, and context length explicitly — rather than letting vLLM guess from defaults. Document the chosen values in your repo's README so the next on-call engineer does not have to rediscover them.

Tensor parallel + FP8 on a single H100 node

An 8×H100 SXM5 node has 640 GB of HBM3, which is enough to host a 70B-class model at FP16, or a 405B-class model at FP8, with reasonable KV cache headroom. The two knobs you want are --tensor-parallel-size (how many GPUs the model is sharded across) and the quantisation mode (FP16 by default, FP8 if you accept a marginal quality hit for a roughly 2× memory saving).

A typical serve command for a 70B model on an 8×H100 node with FP8 quantisation looks like this:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 256 \
  --max-model-len 16384 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

A few notes on the flags. --tensor-parallel-size 8 spreads the weights across all eight GPUs on the node; smaller models do not benefit from full TP and pay a communication overhead instead. --quantization fp8 only works on Hopper (H100, H200) and newer — it is silently rejected on Ampere. --enable-prefix-caching is free latency for any workload with shared system prompts, agentic scaffolding, or repeated few-shot examples; turn it on unless you have a specific reason not to.

Watch out

FP8 is not a free lunch on every model. Some fine-tuned variants degrade noticeably on reasoning benchmarks. Always re-run your eval suite against the quantised model before promoting it to production. We have seen 3–5 per cent regressions on chain-of-thought tasks that did not surface on perplexity checks.

Docker Compose stack — vLLM + nginx + persistent model cache

Most production deployments do not run vLLM bare. They wrap it with a reverse proxy for rate limiting, header rewriting, and TLS termination; a persistent volume for the HuggingFace model cache so a pod restart does not re-download 140 GB of weights; and a health-check endpoint that the orchestrator can probe. Compose makes the wiring explicit and repeatable.

services:
  vllm:
    image: vllm/vllm-openai:v0.9.0
    container_name: vllm
    runtime: nvidia
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - ./models:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --tensor-parallel-size 8
      --quantization fp8
      --gpu-memory-utilization 0.92
      --enable-prefix-caching
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  nginx:
    image: nginx:1.27
    depends_on:
      - vllm
    ports:
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro

The nginx service needs a real nginx.conf in the working directory and TLS certificates in ./certs — whether issued by your internal PKI, an ACME client like acme.sh, or simply self-signed for staging. The ./models volume keeps weights between restarts, which on rented infrastructure where boot disks are slow can save 20 minutes per cold start.

Kubernetes path — KEDA autoscaling + cert-manager TLS

Compose works fine for a single node. The moment you need elasticity, multiple model variants, or rolling updates without dropping requests, you graduate to Kubernetes. Three CNCF projects do most of the lifting: the NVIDIA GPU Operator handles driver and device-plugin installation, KEDA handles event-driven autoscaling, and cert-manager handles TLS renewal.

The shape of the manifests is conventional: a Deployment with one vLLM container per pod, a Service in front of it, an Ingress for external access, and a ScaledObject from KEDA that watches a Prometheus metric — typically GPU utilisation, queue depth, or time-to-first-token — and scales the deployment between a minimum and maximum replica count. cert-manager provisions and rotates the certificate behind the Ingress; a letsencrypt-prod ClusterIssuer handles the ACME dance.

Recommended

Scale on queue depth, not GPU utilisation. A vLLM pod can sit at 95 per cent GPU utilisation while still serving traffic with acceptable latency; what actually breaks the user experience is the queue growing faster than it drains. A KEDA scaler keyed off the vLLM-exported queue-depth metric is far more responsive than one keyed off GPU utilisation alone.

When you should reach for SGLang instead

There is one workload class where the operational simplicity argument does not win, and that is prefix-heavy serving. If your traffic looks like a long shared system prompt plus a small per-request delta — for example, an agent loop with elaborate scaffolding, a chatbot with a large knowledge-base preamble, or a code-completion service with a fixed style guide — SGLang's RadixAttention can deliver substantially better tail latency than vLLM's prefix caching. PremAI's measurements show SGLang at roughly 16,200 tokens per second on H100 against vLLM's 12,500 on smaller-model benchmarks; that gap widens when you isolate prefix-heavy workloads.

The trade-off is operator surface. SGLang has fewer turnkey integrations, smaller community documentation, and a smaller pool of off-the-shelf solutions to incidents. If your team has the engineering bandwidth to operate it, the throughput wins are real. If you are a small platform team in Mumbai or Manchester juggling six on-call rotations, vLLM is still the right answer.

Quick comparison at a glance

Engine	H100 throughput (tok/s, small models)	Prefix cache	Ops complexity	When to pick
vLLM 0.9	~12,500	Yes (prefix-caching flag)	Low	Default choice; broadest tooling
SGLang	~16,200	RadixAttention (best-in-class)	Medium	Prefix-heavy workloads; have ops bandwidth
TensorRT-LLM	Often highest in vendor benchmarks	Engine-dependent	High	Single model, locked architecture, throughput-critical

For a deeper head-to-head, see our earlier piece on picking an LLM serving engine in 2026.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

The H100 capacity-planning cheat sheet for IN + UK teams

The economics of self-hosting versus calling an API have shifted noticeably in the last six months. On the Indian side, providers like Neysa and E2E Networks now list H100 SXM5 hours in the $2.20–$2.80 range; in the UK, Crusoe, Lambda, and a few specialist providers run from roughly £1.80 to £2.40 per H100-hour. Both regions are now well below the $4–$5 levels that made self-hosting a hard sell against API providers in 2024.

The break-even point depends on your token volume and how aggressively you can pack the GPU. The following table is a rough planning aid, not a benchmark — actual numbers depend heavily on workload shape, context lengths, and quantisation choices.

Model size	Tensor-parallel rank (single 8×H100 node)	KV cache footprint (per seq, 4k ctx)	Expected concurrent sequences
8B (FP16)	1	~1.0 GB	60–80
70B (FP16)	4	~2.5 GB	40–60
70B (FP8)	2	~2.5 GB	80–120
405B (FP8)	8	~6.0 GB	20–40

Two regional notes. In India, GPU availability is now generally good in Bengaluru, Hyderabad, Chennai, and Mumbai zones; the bigger constraint is the cost of network egress to non-Indian endpoints, which can dominate your bill if you are serving global users from an Indian node. In the UK, the constraint is more often power and data-centre slot availability than GPU silicon — providers who quote attractive hourly rates sometimes have multi-week onboarding queues.