The state of self-hosted LLM serving in May 2026

Twelve months ago, "serve a Llama" meant a Hugging Face Text Generation Inference container on a single GPU and a hope that traffic would not spike. In May 2026 it means something else entirely. Three serving engines now own the open-source conversation — vLLM, SGLang, TensorRT-LLM — and the decision between them shapes per-call cost, p95 latency, hardware portability and the size of your on-call rota.

Three forces are pulling self-hosted inference back in-house. Cost — even after the H100 price decline of Q1 2026, hosted-API token bills cross the rent-versus-own line at a few hundred million tokens a month (we worked the maths in AI inference costs in 2026). Data residency — RBI's outsourcing guidance, the DPDP Act, NHS Trust procurement and the draft UK Frontier AI Bill all push regulated workloads toward in-country inference; Yotta in Noida and Bengaluru, Tata Communications in Mumbai, and Nscale in Glasgow and Manchester have answered with sovereign-cloud GPU capacity that did not exist a year ago. And utilisation — our coverage of the enterprise GPU utilisation crisis showed median fleets running at five to seven percent.

The managed alternative is real — see DeepInfra's $107M Series B — but for teams with their own H100s, the question is no longer whether to self-host. It is which engine to standardise on.

The three at a glance

Before the benchmarks, the one-line summary of each contender.

Engine Best throughput Best latency Hardware Best for
vLLM ~12,500 tok/s (Llama-3 8B, H100) 120ms TTFT p50 @ 10 concurrent NVIDIA, AMD, TPU, Trainium, Gaudi Quickest path to production; multi-model fleets; non-NVIDIA hardware
SGLang ~16,200 tok/s (Llama-3 8B, H100) 112ms TTFT p50 @ 10 concurrent NVIDIA-optimised (AMD experimental) Multi-turn chat, agents, RAG with shared prefixes
TensorRT-LLM 15–30% higher than vLLM on H100 105ms TTFT p50; 1,280ms p95 @ 100 concurrent NVIDIA-only Single model in long-term production; latency-paramount
Pro tip

Score the engines against your traffic shape before reading any leaderboard. A 29% throughput advantage on benchmark prompts can vanish on a workload with no shared prefixes; a 105ms TTFT win is invisible if your real p95 is 1.4 seconds because of cold-start KV-cache misses. Replay a day of your own production logs through each engine for an honest comparison.

vLLM: the safe default, broadest hardware

vLLM is what most teams should run on day one. It earns the default status with the broadest hardware coverage in the field — NVIDIA H100, H200 and B200; AMD MI300X; Google TPU v5e and v5p; AWS Trainium and Inferentia; Intel Gaudi 2 and 3 — a model catalogue that tracks new releases within days, and a server that boots in a single command. PagedAttention, the original innovation that put vLLM on the map, remains a strong default for general-purpose serving; throughput on a Llama-3 8B sits around 12,500 tokens per second on an H100, with TTFT p50 of about 120ms at ten concurrent requests.

Where vLLM wins decisively is the cases the other two cannot serve. A UK broker piloting open-weight coding models on Trainium spot capacity cannot use TensorRT-LLM at all. A Mumbai platform team running ten fine-tunes for ten internal customers does not want ten TensorRT-LLM build pipelines.

Install and serve are deliberately boring:

# vLLM — pip install, then serve.
pip install vllm

# OpenAI-compatible server on port 8000
vllm serve meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --port 8000

The same command works on H100, MI300X or a TPU v5p slice with only the driver layer changing. For most teams in IN or UK whose first deployment is a small-to-mid model on a handful of GPUs, this is where to start.

SGLang: throughput + multi-turn shine

SGLang earned its 29% throughput lead through one architectural decision: RadixAttention. Where vLLM's PagedAttention manages KV cache per request, RadixAttention indexes the cache as a radix tree across all requests. Two requests that share a prefix — a system prompt, a tool schema, a RAG context, the first ten turns of a chat — share the cached computation. The longer and more common the prefix, the bigger the win.

That makes SGLang the right answer for any workload where prefixes are structurally shared. Multi-turn chat behaviour is the obvious one: every follow-up turn reuses the conversation so far. Agent loops with stable tool catalogues and stable system prompts are the next. RAG pipelines that pin a retrieval window across a session are the third. On a Llama-3 8B on H100, SGLang reaches roughly 16,200 tokens per second against vLLM's 12,500 — a 29% advantage that compounds further as concurrency rises and shared-prefix hit rates climb.

The trade is hardware breadth. SGLang is NVIDIA-optimised first; AMD support exists but is experimental. The serve command is similar to vLLM's, but the runtime favours JSON-mode and tool-schema-heavy traffic patterns:

# SGLang — install + launch.
pip install "sglang[all]"

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-8B-Instruct \
  --port 30000 \
  --tp 1 \
  --mem-fraction-static 0.9 \
  --enable-radix-cache

For a Bengaluru agentic-search startup or a London legal-tech firm running a chat product, SGLang is usually the right second move once the workload mix is clear.

Builder says

"We moved a customer-support agent off vLLM to SGLang in March. Same H100, same model — throughput jumped 27% on our real traffic and our p95 went from 1.6s to 1.2s. The win was almost entirely RadixAttention reusing the 4k-token system prompt across sessions. Two days of work, zero infra changes." — A Verified Builder · Pune

TensorRT-LLM: lowest latency under load, NVIDIA-only

TensorRT-LLM is the only engine in the trio that compiles your model into an optimised plan before serving a single token. That ahead-of-time compilation is why it delivers 15 to 30 percent higher throughput than vLLM on H100s and the lowest p50 and p95 TTFT at every concurrency level reported in the public benchmarks. At ten concurrent requests, p50 TTFT sits at 105ms; at 100 concurrent, p95 is 1,280ms against vLLM's 1,450ms — a difference your users feel.

The bill comes due at build time. You compile a plan file per model, per GPU type, per tensor-parallel size, per quantisation choice. For a UK fintech serving a single Llama-3 70B variant on a fixed H100 cluster in Slough for twelve months, that is acceptable. For a GCC platform team rotating eight fine-tunes a week, it is a non-starter.

# TensorRT-LLM — convert, build, serve.
git clone https://github.com/NVIDIA/TensorRT-LLM
cd TensorRT-LLM && pip install -r requirements.txt

# 1. Convert HF checkpoint to TRT-LLM format
python examples/llama/convert_checkpoint.py \
  --model_dir /models/llama-3-8b-instruct \
  --output_dir /trt/llama3-8b/ckpt \
  --dtype float16

# 2. Build the engine plan
trtllm-build \
  --checkpoint_dir /trt/llama3-8b/ckpt \
  --output_dir /trt/llama3-8b/engine \
  --gemm_plugin float16 \
  --max_batch_size 64 \
  --max_input_len 4096 \
  --max_output_len 1024

# 3. Serve with the Triton TRT-LLM backend
tritonserver --model-repository=/trt/llama3-8b/triton

That pipeline takes thirty to sixty minutes per model. Worth it when you pin a model for a quarter; painful when you change models every Friday.

Benchmark walkthrough

Numbers in context. All figures below are Llama-3 8B on a single H100 SXM with INT8 KV cache, measured against a shared mix of system-prompt-heavy chat prompts (roughly 60% of the workload sharing a 2k-token system prefix).

Concurrency Engine Throughput (tok/s) TTFT p50 (ms) TTFT p95 (ms)
10 vLLM 12,500 120 210
SGLang 16,200 112 195
TensorRT-LLM 14,800 105 180
50 vLLM 21,400 340 720
SGLang 26,100 295 640
TensorRT-LLM 25,300 270 590
100 vLLM 27,800 820 1,450
SGLang 32,400 740 1,360
TensorRT-LLM 33,600 690 1,280

Two patterns jump out: SGLang's throughput lead is largest at low and mid concurrency, where the radix cache hit rate is highest; TensorRT-LLM closes the throughput gap at 100 concurrent and owns the latency column at every level — its compiled plan handles back-pressure more gracefully than either Python-based rival.

The shared-prefix trick — why SGLang dominates agent workloads

RadixAttention deserves its own section because it is the single architectural difference that explains most of SGLang's wins. The idea: store the KV cache for every token sequence the server has ever computed, indexed in a radix tree by prefix. When a new request arrives, walk the tree to find the longest matching prefix and reuse the cached KV; only compute the tokens beyond the match.

For an agent loop with a 4,000-token system prompt and tool catalogue, every turn after the first reuses 4,000 tokens of cached attention. For a RAG pipeline that pins a retrieved passage across follow-ups, the prefix hit can exceed 8,000 tokens. We covered the wider KV-cache-compression frontier in our TurboQuant deep-dive; RadixAttention is the orthogonal trick — instead of compressing the cache, you reuse more of it. The 29% headline understates the agent win: on extraction with no shared prefix, SGLang's lead compresses to single digits; on a multi-turn chatbot with a stable system prompt, 30 to 40 percent gains are routine.

Watch out

RadixAttention's wins assume your workload actually shares prefixes. If your system prompt rotates every request (some A/B test frameworks do this), or you use per-user dynamic instructions at the top of the prompt, the radix cache thrashes and SGLang collapses to vLLM-level throughput while paying the indexing overhead. Audit your prompt layout before promising the throughput numbers above.

Want to compare notes with other infra Builders?

Every article on AI Tech Connect is written by a Verified Builder or our editorial team. Browse profiles and shortlist who you want to hire or collaborate with.

Browse Builders →

Cost-of-ownership: GPU-hour math for each engine

A worked example. Target: 50 million tokens per day on a single H100. Sovereign-cloud H100 rates from Yotta and Nscale sit at roughly $2.40 per GPU-hour on a 12-month commit (Mumbai data centre and London co-location both land in this band in May 2026) — $1,728 per month per H100.

  • vLLM at 12,500 tok/s sustained: 50M ÷ 12,500 ÷ 86,400s = 0.46 H100-days of pure compute. With 50% headroom for peaks, ~1.0 H100 continuously. Monthly cost: $1,728.
  • SGLang at 16,200 tok/s sustained: 0.36 H100-days; ~0.75 H100, but the cheapest commit is still one. Monthly cost: $1,728 — same bill, more headroom and twice the agent-workload capacity.
  • TensorRT-LLM at 25,300 tok/s sustained: 0.23 H100-days; ~75% spare capacity on the same $1,728 H100 — two-to-three times further before adding a second GPU.

For a single-H100 deployment, all three engines cost the same in money. What you buy is headroom: vLLM gives operational simplicity, SGLang gives agent throughput, TensorRT-LLM gives the most spare capacity per dollar — paid for in build-pipeline labour.

Implementation decision tree

Five questions, in order, will put most teams in the right lane.

  • 1. Is your hardware NVIDIA-only for the next 12 months? If no, vLLM. The other two cannot serve your Trainium, Gaudi or TPU capacity.
  • 2. How many distinct models do you serve? More than three actively rotating? vLLM or SGLang. One pinned model? TensorRT-LLM becomes viable.
  • 3. What does your traffic shape look like? Bursty, agentic, multi-turn? SGLang. Steady, single-turn, latency-critical? TensorRT-LLM. Mixed bag? vLLM.
  • 4. What's your ops budget? A two-person platform team should not own a TensorRT-LLM build pipeline alongside everything else. Stay on vLLM or SGLang.
  • 5. Agent or chat or batch? Agent + chat with shared prefixes: SGLang. Batch and extraction: vLLM or TensorRT-LLM, whichever your ops team already runs.

Migration patterns

Two migrations are common enough to map.

vLLM → SGLang for agent workloads. Start with SGLang's OpenAI-compatible endpoint; your client code does not change. Move one workload at a time, watching p95 and prefix hit rate. Audit your prompt structure to make sure the prefix is genuinely shared. Most teams see the throughput lift inside a week.

vLLM → TensorRT-LLM for fixed production. Bigger lift. Add an engine-build CI job, a model-registry step that publishes engine plans alongside checkpoints, and Triton in front. Plan two to four engineer-weeks for the first migration, then a day or two per subsequent model. Skip this entirely if your model roster changes more than once a quarter.

Where each will move next

All three engines are converging on the same feature set, with different priorities. Continuous batching is now table stakes for everyone. Speculative decoding — using a small draft model to propose tokens that a large model verifies — is mature in TensorRT-LLM, shipping in vLLM, and in active development in SGLang; expect another two-to-three times throughput lift on suitable models by end of 2026.

Custom kernels are where SGLang and TensorRT-LLM will keep their NVIDIA edge. vLLM's portability story prevents the same depth of hardware-specific tuning, which is the engineering price of breadth. Look for vLLM to lean harder into model-server features — observability, autoscaling, multi-LoRA serving — that are orthogonal to raw kernel speed.

For Verified Builders in IN and UK shipping production inference in 2026, the practical advice is unchanged from the start of this piece: pick vLLM by default, move to SGLang when the workload shape justifies it, and earn your way into TensorRT-LLM only when a single model and a stable team make the build pipeline worth owning.

Primary sources for this piece: github.com/vllm-project/vllm, github.com/sgl-project/sglang, and github.com/NVIDIA/TensorRT-LLM.