Why vLLM is still the operational default in 2026
vLLM is not the throughput leader anymore. PremAI's measurements on H100 put SGLang at roughly 16,200 tokens per second against vLLM's 12,500 on the smaller-model benchmarks that get quoted in launch posts. TensorRT-LLM goes further still when you accept the engine-rebuild tax. So why does vLLM remain the default that most teams in Bengaluru and London actually run in production?
Because raw throughput is almost never the constraint that matters. What matters is the operator surface: how many models you can serve without a code change, how predictable scaling behaviour is, how good the documentation is when a junior engineer takes the pager at 2 a.m., and how easily the stack composes with the rest of your platform — ingress, autoscaler, observability, certificate rotation. On every one of those axes vLLM is in front, and that compounds into less downtime and faster iteration than a 25 per cent throughput gap can claw back.
The other reason is consolidation. Most of the open-source inference ecosystem — model adapters, quantisation toolkits, structured-output schemas, speculative decoding plumbing — lands in vLLM first. If you pick vLLM you inherit a year-long head start on tooling. If you pick SGLang you inherit a faster engine and a smaller pool of off-the-shelf solutions to operational problems.
Pick the engine that minimises your operator workload, not the one that wins the throughput benchmark on a blog post. The cost of an extra engineer-week is larger than the cost of a few extra GPU-hours per month, especially on rented H100s at $2–$3 per hour.
The four prerequisites that bite teams in week one
vLLM is forgiving on the happy path and unforgiving when one of the prerequisites is even slightly off. We see four mismatches over and over in incident reports — driver too old, CUDA toolchain wrong, container toolkit missing, Compose version too low. Pin the floor versions before you do anything else.
| Component | Minimum version | Why it matters |
|---|---|---|
| NVIDIA driver | 525 | FP8 paths and recent CUDA features fail silently on older drivers |
| CUDA toolkit | 12.1 | Required by vLLM 0.9 wheels and FlashAttention-3 kernels |
| NVIDIA Container Toolkit | 1.14 | Older versions break GPU passthrough on Compose V2 |
| Docker Engine + Compose V2 | 23.0 | Earlier Compose lacks device_requests GPU shorthand |
| Kubernetes | 1.27 | Required for the NVIDIA GPU Operator features vLLM relies on |
| KEDA | v2.x | ScaledObject CRD and Prometheus scaler stability |
| cert-manager | latest stable | With a letsencrypt-prod ClusterIssuer for ingress TLS |
Cloud-provider GPU images often ship with a driver pinned older than 525 for stability. Check nvidia-smi on first boot — if the driver is older than 525, request a newer image rather than try to upgrade in place. We have lost more than one weekend to driver upgrades that broke containerd.
PagedAttention — the actual reason vLLM exists
The attention mechanism behind every transformer keeps a key-value cache for every token in every active sequence. In the naive implementation that cache lives in one contiguous chunk per sequence. The problem with contiguous allocations is fragmentation: when sequences have different lengths and finish at different times, the GPU memory map ends up looking like a Swiss cheese, and you cannot start a new sequence even when the total free memory would otherwise allow it.
PagedAttention takes the same trick the operating system uses for virtual memory and applies it to the KV cache. The cache is split into fixed-size blocks — pages — and the runtime maintains a logical-to-physical mapping. A sequence's pages do not need to be contiguous in physical memory; they only need to be tracked. The result is that you can pack many more concurrent sequences into the same GPU memory than a naive implementation would allow, and the system degrades gracefully under load instead of OOMing the moment fragmentation crosses a threshold.
The visible effect for a Builder is a higher steady-state batch size, more predictable tail latency, and a lower probability that a single long request will block shorter ones. None of that comes from raw kernel speed — it comes from better memory accounting. That accounting is what you are buying when you pick vLLM, and it is the bedrock that all of the tuning below sits on.
Tuning gpu-memory-utilization without OOMing
The single most important knob on a vLLM deployment is --gpu-memory-utilization. It controls the fraction of GPU memory vLLM is allowed to claim for the KV cache and the model weights. The default is 0.9, which is conservative enough to survive most workloads. It is also the value that leaves the most performance on the table.
The right approach is a controlled walk-up. Start at 0.9 and run your real workload, not a synthetic benchmark — the latter rarely captures the burst patterns or the long-tail context lengths that cause OOMs in production. Watch nvidia-smi for memory headroom and the vLLM logs for cache-eviction events. Then bump the value to 0.92, repeat, and continue until you are within one OOM event of the ceiling. For most teams the sweet spot lands between 0.92 and 0.95.
Combine the gpu-memory-utilization walk-up with explicit --max-num-seqs and --max-model-len flags. These three knobs together let you trade KV cache footprint, concurrency ceiling, and context length explicitly — rather than letting vLLM guess from defaults. Document the chosen values in your repo's README so the next on-call engineer does not have to rediscover them.
Tensor parallel + FP8 on a single H100 node
An 8×H100 SXM5 node has 640 GB of HBM3, which is enough to host a 70B-class model at FP16, or a 405B-class model at FP8, with reasonable KV cache headroom. The two knobs you want are --tensor-parallel-size (how many GPUs the model is sharded across) and the quantisation mode (FP16 by default, FP8 if you accept a marginal quality hit for a roughly 2× memory saving).
A typical serve command for a 70B model on an 8×H100 node with FP8 quantisation looks like this:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 256 \
--max-model-len 16384 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000
A few notes on the flags. --tensor-parallel-size 8 spreads the weights across all eight GPUs on the node; smaller models do not benefit from full TP and pay a communication overhead instead. --quantization fp8 only works on Hopper (H100, H200) and newer — it is silently rejected on Ampere. --enable-prefix-caching is free latency for any workload with shared system prompts, agentic scaffolding, or repeated few-shot examples; turn it on unless you have a specific reason not to.
FP8 is not a free lunch on every model. Some fine-tuned variants degrade noticeably on reasoning benchmarks. Always re-run your eval suite against the quantised model before promoting it to production. We have seen 3–5 per cent regressions on chain-of-thought tasks that did not surface on perplexity checks.
Docker Compose stack — vLLM + nginx + persistent model cache
Most production deployments do not run vLLM bare. They wrap it with a reverse proxy for rate limiting, header rewriting, and TLS termination; a persistent volume for the HuggingFace model cache so a pod restart does not re-download 140 GB of weights; and a health-check endpoint that the orchestrator can probe. Compose makes the wiring explicit and repeatable.
services:
vllm:
image: vllm/vllm-openai:v0.9.0
container_name: vllm
runtime: nvidia
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- ./models:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 8
--quantization fp8
--gpu-memory-utilization 0.92
--enable-prefix-caching
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
nginx:
image: nginx:1.27
depends_on:
- vllm
ports:
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
The nginx service needs a real nginx.conf in the working directory and TLS certificates in ./certs — whether issued by your internal PKI, an ACME client like acme.sh, or simply self-signed for staging. The ./models volume keeps weights between restarts, which on rented infrastructure where boot disks are slow can save 20 minutes per cold start.
Kubernetes path — KEDA autoscaling + cert-manager TLS
Compose works fine for a single node. The moment you need elasticity, multiple model variants, or rolling updates without dropping requests, you graduate to Kubernetes. Three CNCF projects do most of the lifting: the NVIDIA GPU Operator handles driver and device-plugin installation, KEDA handles event-driven autoscaling, and cert-manager handles TLS renewal.
The shape of the manifests is conventional: a Deployment with one vLLM container per pod, a Service in front of it, an Ingress for external access, and a ScaledObject from KEDA that watches a Prometheus metric — typically GPU utilisation, queue depth, or time-to-first-token — and scales the deployment between a minimum and maximum replica count. cert-manager provisions and rotates the certificate behind the Ingress; a letsencrypt-prod ClusterIssuer handles the ACME dance.
Scale on queue depth, not GPU utilisation. A vLLM pod can sit at 95 per cent GPU utilisation while still serving traffic with acceptable latency; what actually breaks the user experience is the queue growing faster than it drains. A KEDA scaler keyed off the vLLM-exported queue-depth metric is far more responsive than one keyed off GPU utilisation alone.
When you should reach for SGLang instead
There is one workload class where the operational simplicity argument does not win, and that is prefix-heavy serving. If your traffic looks like a long shared system prompt plus a small per-request delta — for example, an agent loop with elaborate scaffolding, a chatbot with a large knowledge-base preamble, or a code-completion service with a fixed style guide — SGLang's RadixAttention can deliver substantially better tail latency than vLLM's prefix caching. PremAI's measurements show SGLang at roughly 16,200 tokens per second on H100 against vLLM's 12,500 on smaller-model benchmarks; that gap widens when you isolate prefix-heavy workloads.
The trade-off is operator surface. SGLang has fewer turnkey integrations, smaller community documentation, and a smaller pool of off-the-shelf solutions to incidents. If your team has the engineering bandwidth to operate it, the throughput wins are real. If you are a small platform team in Mumbai or Manchester juggling six on-call rotations, vLLM is still the right answer.
Quick comparison at a glance
| Engine | H100 throughput (tok/s, small models) | Prefix cache | Ops complexity | When to pick |
|---|---|---|---|---|
| vLLM 0.9 | ~12,500 | Yes (prefix-caching flag) | Low | Default choice; broadest tooling |
| SGLang | ~16,200 | RadixAttention (best-in-class) | Medium | Prefix-heavy workloads; have ops bandwidth |
| TensorRT-LLM | Often highest in vendor benchmarks | Engine-dependent | High | Single model, locked architecture, throughput-critical |
For a deeper head-to-head, see our earlier piece on picking an LLM serving engine in 2026.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →The H100 capacity-planning cheat sheet for IN + UK teams
The economics of self-hosting versus calling an API have shifted noticeably in the last six months. On the Indian side, providers like Neysa and E2E Networks now list H100 SXM5 hours in the $2.20–$2.80 range; in the UK, Crusoe, Lambda, and a few specialist providers run from roughly £1.80 to £2.40 per H100-hour. Both regions are now well below the $4–$5 levels that made self-hosting a hard sell against API providers in 2024.
The break-even point depends on your token volume and how aggressively you can pack the GPU. The following table is a rough planning aid, not a benchmark — actual numbers depend heavily on workload shape, context lengths, and quantisation choices.
| Model size | Tensor-parallel rank (single 8×H100 node) | KV cache footprint (per seq, 4k ctx) | Expected concurrent sequences |
|---|---|---|---|
| 8B (FP16) | 1 | ~1.0 GB | 60–80 |
| 70B (FP16) | 4 | ~2.5 GB | 40–60 |
| 70B (FP8) | 2 | ~2.5 GB | 80–120 |
| 405B (FP8) | 8 | ~6.0 GB | 20–40 |
Two regional notes. In India, GPU availability is now generally good in Bengaluru, Hyderabad, Chennai, and Mumbai zones; the bigger constraint is the cost of network egress to non-Indian endpoints, which can dominate your bill if you are serving global users from an Indian node. In the UK, the constraint is more often power and data-centre slot availability than GPU silicon — providers who quote attractive hourly rates sometimes have multi-week onboarding queues.
Before committing to a self-hosted deployment, run the workload for a fortnight on an API provider and capture the token volume distribution. Multiply the median throughput by 8 hours per H100-hour to model self-hosted economics — if your blended cost per million tokens lands below half the API price, self-hosting on vLLM is justified. If it does not, stay on the API and revisit in six months.
For more context on the wider GPU pricing environment, see our pieces on H100 at $2/hr and B200 versus H100 inference economics. Primary sources for this article: vLLM documentation, Spheron's 2026 production deployment guide, SitePoint's 2026 vLLM walkthrough, inference.net on Docker deployments, and MorphLLM's benchmark notes.