Why vLLM is still the operational default in 2026

vLLM is not the throughput leader anymore. PremAI's measurements on H100 put SGLang at roughly 16,200 tokens per second against vLLM's 12,500 on the smaller-model benchmarks that get quoted in launch posts. TensorRT-LLM goes further still when you accept the engine-rebuild tax. So why does vLLM remain the default that most teams in Bengaluru and London actually run in production?

Because raw throughput is almost never the constraint that matters. What matters is the operator surface: how many models you can serve without a code change, how predictable scaling behaviour is, how good the documentation is when a junior engineer takes the pager at 2 a.m., and how easily the stack composes with the rest of your platform — ingress, autoscaler, observability, certificate rotation. On every one of those axes vLLM is in front, and that compounds into less downtime and faster iteration than a 25 per cent throughput gap can claw back.

The other reason is consolidation. Most of the open-source inference ecosystem — model adapters, quantisation toolkits, structured-output schemas, speculative decoding plumbing — lands in vLLM first. If you pick vLLM you inherit a year-long head start on tooling. If you pick SGLang you inherit a faster engine and a smaller pool of off-the-shelf solutions to operational problems.

Pro tip

Pick the engine that minimises your operator workload, not the one that wins the throughput benchmark on a blog post. The cost of an extra engineer-week is larger than the cost of a few extra GPU-hours per month, especially on rented H100s at $2–$3 per hour.

The four prerequisites that bite teams in week one

vLLM is forgiving on the happy path and unforgiving when one of the prerequisites is even slightly off. We see four mismatches over and over in incident reports — driver too old, CUDA toolchain wrong, container toolkit missing, Compose version too low. Pin the floor versions before you do anything else.

Component Minimum version Why it matters
NVIDIA driver 525 FP8 paths and recent CUDA features fail silently on older drivers
CUDA toolkit 12.1 Required by vLLM 0.9 wheels and FlashAttention-3 kernels
NVIDIA Container Toolkit 1.14 Older versions break GPU passthrough on Compose V2
Docker Engine + Compose V2 23.0 Earlier Compose lacks device_requests GPU shorthand
Kubernetes 1.27 Required for the NVIDIA GPU Operator features vLLM relies on
KEDA v2.x ScaledObject CRD and Prometheus scaler stability
cert-manager latest stable With a letsencrypt-prod ClusterIssuer for ingress TLS
Watch out

Cloud-provider GPU images often ship with a driver pinned older than 525 for stability. Check nvidia-smi on first boot — if the driver is older than 525, request a newer image rather than try to upgrade in place. We have lost more than one weekend to driver upgrades that broke containerd.

PagedAttention — the actual reason vLLM exists

The attention mechanism behind every transformer keeps a key-value cache for every token in every active sequence. In the naive implementation that cache lives in one contiguous chunk per sequence. The problem with contiguous allocations is fragmentation: when sequences have different lengths and finish at different times, the GPU memory map ends up looking like a Swiss cheese, and you cannot start a new sequence even when the total free memory would otherwise allow it.

PagedAttention takes the same trick the operating system uses for virtual memory and applies it to the KV cache. The cache is split into fixed-size blocks — pages — and the runtime maintains a logical-to-physical mapping. A sequence's pages do not need to be contiguous in physical memory; they only need to be tracked. The result is that you can pack many more concurrent sequences into the same GPU memory than a naive implementation would allow, and the system degrades gracefully under load instead of OOMing the moment fragmentation crosses a threshold.

The visible effect for a Builder is a higher steady-state batch size, more predictable tail latency, and a lower probability that a single long request will block shorter ones. None of that comes from raw kernel speed — it comes from better memory accounting. That accounting is what you are buying when you pick vLLM, and it is the bedrock that all of the tuning below sits on.

Tuning gpu-memory-utilization without OOMing

The single most important knob on a vLLM deployment is --gpu-memory-utilization. It controls the fraction of GPU memory vLLM is allowed to claim for the KV cache and the model weights. The default is 0.9, which is conservative enough to survive most workloads. It is also the value that leaves the most performance on the table.

The right approach is a controlled walk-up. Start at 0.9 and run your real workload, not a synthetic benchmark — the latter rarely captures the burst patterns or the long-tail context lengths that cause OOMs in production. Watch nvidia-smi for memory headroom and the vLLM logs for cache-eviction events. Then bump the value to 0.92, repeat, and continue until you are within one OOM event of the ceiling. For most teams the sweet spot lands between 0.92 and 0.95.

Recommended

Combine the gpu-memory-utilization walk-up with explicit --max-num-seqs and --max-model-len flags. These three knobs together let you trade KV cache footprint, concurrency ceiling, and context length explicitly — rather than letting vLLM guess from defaults. Document the chosen values in your repo's README so the next on-call engineer does not have to rediscover them.

Tensor parallel + FP8 on a single H100 node

An 8×H100 SXM5 node has 640 GB of HBM3, which is enough to host a 70B-class model at FP16, or a 405B-class model at FP8, with reasonable KV cache headroom. The two knobs you want are --tensor-parallel-size (how many GPUs the model is sharded across) and the quantisation mode (FP16 by default, FP8 if you accept a marginal quality hit for a roughly 2× memory saving).

A typical serve command for a 70B model on an 8×H100 node with FP8 quantisation looks like this:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 256 \
  --max-model-len 16384 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

A few notes on the flags. --tensor-parallel-size 8 spreads the weights across all eight GPUs on the node; smaller models do not benefit from full TP and pay a communication overhead instead. --quantization fp8 only works on Hopper (H100, H200) and newer — it is silently rejected on Ampere. --enable-prefix-caching is free latency for any workload with shared system prompts, agentic scaffolding, or repeated few-shot examples; turn it on unless you have a specific reason not to.

Watch out

FP8 is not a free lunch on every model. Some fine-tuned variants degrade noticeably on reasoning benchmarks. Always re-run your eval suite against the quantised model before promoting it to production. We have seen 3–5 per cent regressions on chain-of-thought tasks that did not surface on perplexity checks.

Docker Compose stack — vLLM + nginx + persistent model cache

Most production deployments do not run vLLM bare. They wrap it with a reverse proxy for rate limiting, header rewriting, and TLS termination; a persistent volume for the HuggingFace model cache so a pod restart does not re-download 140 GB of weights; and a health-check endpoint that the orchestrator can probe. Compose makes the wiring explicit and repeatable.

services:
  vllm:
    image: vllm/vllm-openai:v0.9.0
    container_name: vllm
    runtime: nvidia
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - ./models:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --tensor-parallel-size 8
      --quantization fp8
      --gpu-memory-utilization 0.92
      --enable-prefix-caching
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  nginx:
    image: nginx:1.27
    depends_on:
      - vllm
    ports:
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro

The nginx service needs a real nginx.conf in the working directory and TLS certificates in ./certs — whether issued by your internal PKI, an ACME client like acme.sh, or simply self-signed for staging. The ./models volume keeps weights between restarts, which on rented infrastructure where boot disks are slow can save 20 minutes per cold start.

Kubernetes path — KEDA autoscaling + cert-manager TLS

Compose works fine for a single node. The moment you need elasticity, multiple model variants, or rolling updates without dropping requests, you graduate to Kubernetes. Three CNCF projects do most of the lifting: the NVIDIA GPU Operator handles driver and device-plugin installation, KEDA handles event-driven autoscaling, and cert-manager handles TLS renewal.

The shape of the manifests is conventional: a Deployment with one vLLM container per pod, a Service in front of it, an Ingress for external access, and a ScaledObject from KEDA that watches a Prometheus metric — typically GPU utilisation, queue depth, or time-to-first-token — and scales the deployment between a minimum and maximum replica count. cert-manager provisions and rotates the certificate behind the Ingress; a letsencrypt-prod ClusterIssuer handles the ACME dance.

Recommended

Scale on queue depth, not GPU utilisation. A vLLM pod can sit at 95 per cent GPU utilisation while still serving traffic with acceptable latency; what actually breaks the user experience is the queue growing faster than it drains. A KEDA scaler keyed off the vLLM-exported queue-depth metric is far more responsive than one keyed off GPU utilisation alone.

When you should reach for SGLang instead

There is one workload class where the operational simplicity argument does not win, and that is prefix-heavy serving. If your traffic looks like a long shared system prompt plus a small per-request delta — for example, an agent loop with elaborate scaffolding, a chatbot with a large knowledge-base preamble, or a code-completion service with a fixed style guide — SGLang's RadixAttention can deliver substantially better tail latency than vLLM's prefix caching. PremAI's measurements show SGLang at roughly 16,200 tokens per second on H100 against vLLM's 12,500 on smaller-model benchmarks; that gap widens when you isolate prefix-heavy workloads.

The trade-off is operator surface. SGLang has fewer turnkey integrations, smaller community documentation, and a smaller pool of off-the-shelf solutions to incidents. If your team has the engineering bandwidth to operate it, the throughput wins are real. If you are a small platform team in Mumbai or Manchester juggling six on-call rotations, vLLM is still the right answer.

Quick comparison at a glance

Engine H100 throughput (tok/s, small models) Prefix cache Ops complexity When to pick
vLLM 0.9 ~12,500 Yes (prefix-caching flag) Low Default choice; broadest tooling
SGLang ~16,200 RadixAttention (best-in-class) Medium Prefix-heavy workloads; have ops bandwidth
TensorRT-LLM Often highest in vendor benchmarks Engine-dependent High Single model, locked architecture, throughput-critical

For a deeper head-to-head, see our earlier piece on picking an LLM serving engine in 2026.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

The H100 capacity-planning cheat sheet for IN + UK teams

The economics of self-hosting versus calling an API have shifted noticeably in the last six months. On the Indian side, providers like Neysa and E2E Networks now list H100 SXM5 hours in the $2.20–$2.80 range; in the UK, Crusoe, Lambda, and a few specialist providers run from roughly £1.80 to £2.40 per H100-hour. Both regions are now well below the $4–$5 levels that made self-hosting a hard sell against API providers in 2024.

The break-even point depends on your token volume and how aggressively you can pack the GPU. The following table is a rough planning aid, not a benchmark — actual numbers depend heavily on workload shape, context lengths, and quantisation choices.

Model size Tensor-parallel rank (single 8×H100 node) KV cache footprint (per seq, 4k ctx) Expected concurrent sequences
8B (FP16) 1 ~1.0 GB 60–80
70B (FP16) 4 ~2.5 GB 40–60
70B (FP8) 2 ~2.5 GB 80–120
405B (FP8) 8 ~6.0 GB 20–40

Two regional notes. In India, GPU availability is now generally good in Bengaluru, Hyderabad, Chennai, and Mumbai zones; the bigger constraint is the cost of network egress to non-Indian endpoints, which can dominate your bill if you are serving global users from an Indian node. In the UK, the constraint is more often power and data-centre slot availability than GPU silicon — providers who quote attractive hourly rates sometimes have multi-week onboarding queues.

Pro tip

Before committing to a self-hosted deployment, run the workload for a fortnight on an API provider and capture the token volume distribution. Multiply the median throughput by 8 hours per H100-hour to model self-hosted economics — if your blended cost per million tokens lands below half the API price, self-hosting on vLLM is justified. If it does not, stay on the API and revisit in six months.

For more context on the wider GPU pricing environment, see our pieces on H100 at $2/hr and B200 versus H100 inference economics. Primary sources for this article: vLLM documentation, Spheron's 2026 production deployment guide, SitePoint's 2026 vLLM walkthrough, inference.net on Docker deployments, and MorphLLM's benchmark notes.