What you need to know
- Self-hosting is a utilisation bet, not a price comparison. A rented GPU costs the same whether it serves one request a minute or two hundred. Self-hosting only beats an API when sustained load keeps that GPU busy — or when data residency leaves you no choice.
- vLLM is the sensible default engine. Its PagedAttention and continuous batching deliver an order-of-magnitude throughput gain over naive serving, it exposes an OpenAI-compatible API, and it installs in roughly three commands.
- Size VRAM from first principles. Weights are parameters multiplied by bytes per parameter; the KV cache is the variable that bites under concurrency. Plan for both, and never fill a card past about 85 to 90 percent.
- FP8 is the default quantisation; INT4 is a fitting tool. FP8 on H100-class cards is near-lossless and roughly doubles throughput. INT4 cuts VRAM by about three-quarters but risks precision-sensitive tasks.
- The real cost is people. Beyond GPU rental, budget 10 to 20 engineering hours a month — about 750 to 3,000 US dollars in senior-engineer labour — for maintenance and on-call.
- Data residency is the strongest single reason. DPDP in India and UK GDPR can both make self-hosting the only compliant option for some workloads, regardless of cost.
Before you provision a single GPU, write down your peak and median requests per second, your tolerable p95 latency, and your data-residency constraints. Those three numbers — not the headline GPU-hour rate — decide whether self-hosting is the right call. Most teams that regret self-hosting skipped this step and discovered too late that their GPU sat idle 80 percent of the day.
Why self-hosting is back on the table in 2026
For a long time the calculus was simple: closed frontier models were so far ahead that running anything yourself meant accepting a large quality gap. That gap has narrowed to the point where, for many production tasks, it no longer exists. The leading open-weight families you would realistically self-host as of mid-2026 — DeepSeek V4 and the lighter V4 Flash, Qwen 3.5 under Apache 2.0, Gemma 4 under Apache 2.0, Llama 4, and MiniMax M3 — are genuinely capable, and several carry permissive licences that let you ship commercially without a bespoke agreement.
At the same time, the serving software matured. vLLM turned high-throughput inference from a research project into a one-line install, and GPU rental prices have fallen steadily across both the global clouds and regional providers. The combination means the question is no longer "can we self-host a good model?" but "should we?" — and that is a business and compliance question as much as a technical one.
For Indian and UK teams there is an extra dimension. The Digital Personal Data Protection Act in India and UK GDPR both create scenarios where sending user data to an external API in another jurisdiction is awkward, expensive to indemnify, or simply not permitted under a customer contract. A model you run inside your own AWS Mumbai (ap-south-1) or London (eu-west-2) account, or on regional GPU capacity, sidesteps that entirely. We come back to residency in detail below, because for some teams it is the deciding factor.
Choosing an inference engine
The engine is the software layer that loads the weights, schedules requests, manages the KV cache and serves the API. Pick it before you pick hardware, because the engine determines how efficiently that hardware is used. There is no single winner; there is a right tool per workload shape.
| Engine | Best for | Hardware coverage | Trade-off |
|---|---|---|---|
| vLLM | General serving and batch throughput; the safe default | Broad — NVIDIA, AMD, and more | Tuning helps, but defaults are strong out of the box |
| SGLang | Prefix-heavy RAG and multi-turn chat (RadixAttention prefix cache) | Mainly NVIDIA | Newer; smaller community than vLLM |
| Ollama | Desktop prototyping and local development | CPU, Apple Silicon, consumer GPUs | Not built for production concurrency; throughput lags vLLM heavily |
| TensorRT-LLM | Last-increment throughput when all-in on NVIDIA | NVIDIA only | Heavier build, engine compilation, steeper tuning curve |
The practical rule: start with vLLM. It covers the widest range of hardware, which matters when you are sourcing GPUs across AWS, a regional provider in India such as the IndiaAI compute pool, or a UK cloud region, and you cannot guarantee you will only ever land on NVIDIA. If your workload is dominated by long, shared prefixes — a RAG system answering many questions against the same retrieved context, or a chat product with a large constant system prompt — benchmark SGLang against it, because its prefix cache can pull ahead meaningfully on exactly that pattern. Keep Ollama on developer laptops. Reserve TensorRT-LLM for the case where you have already committed to NVIDIA hardware, have measured a throughput ceiling on vLLM, and have the platform-engineering time to maintain compiled engines.
vLLM in three commands
vLLM earns its default status partly on capability and partly on how little ceremony it takes to stand up. It builds on two ideas. PagedAttention borrows the operating-system notion of paged virtual memory and applies it to the KV cache, allocating it in small fixed-size blocks rather than one contiguous slab. That eliminates the fragmentation that otherwise wastes VRAM and lets you pack far more concurrent sequences onto a card. Continuous batching then lets new requests join an in-flight batch the moment a slot frees, instead of waiting for the whole batch to finish, which keeps the GPU saturated. Together they are why vLLM reports throughput an order of magnitude or more above naive Hugging Face Transformers serving on the same hardware.
Installing and serving a model looks like this:
# 1. Install
pip install vllm
# 2. Serve an open-weight model with an OpenAI-compatible API
vllm serve Qwen/Qwen3.5-32B \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
# 3. (optional) for a 70B-class model across two cards
vllm serve meta-llama/Llama-4-70B \
--tensor-parallel-size 2 \
--quantization fp8
The server now speaks the OpenAI Chat Completions protocol, which means your existing client code barely changes — you point the base URL at your own host and supply any non-empty key:
from openai import OpenAI
client = OpenAI(
base_url="http://your-host:8000/v1",
api_key="not-needed-but-required", # any non-empty string
)
resp = client.chat.completions.create(
model="Qwen/Qwen3.5-32B",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarise the DPDP residency rule in two lines."},
],
max_tokens=256,
)
print(resp.choices[0].message.content)
That OpenAI compatibility is strategically important: it means self-hosting is reversible. You can prototype against a managed API, switch the base URL to your own vLLM server when you are ready, and switch back if the economics change — without rewriting your application. Treat the API contract, not the provider, as the stable interface. The full server reference and supported flags live at docs.vllm.ai.
Sizing VRAM: the maths that decides your bill
Almost every self-hosting mistake traces back to getting VRAM wrong — either over-provisioning and burning money, or under-provisioning and watching the server out-of-memory under load. The calculation has two parts.
Part one: model weights
Weight memory is simply parameters × bytes per parameter. The bytes depend on precision: FP16 is 2 bytes, FP8 is 1 byte, and INT4 is 0.5 bytes. So a 70-billion-parameter model needs roughly 140GB at FP16, about 70GB at FP8, and about 35GB at INT4. That single line explains why FP8 is the pivotal precision for the current generation: 70GB just squeezes onto a single H100 80GB card, where FP16 would force you onto two.
Part two: the KV cache
The KV cache is the memory that holds the attention keys and values for every token in every active request. It grows with sequence length and with concurrency, and under heavy load it can consume 20 to 40 percent of total VRAM. This is the part teams forget. You can fit the weights perfectly and still fall over the moment fifty users send long prompts at once, because the KV cache had nowhere to grow. Always size for weights plus a realistic concurrent KV-cache budget, and leave 10 to 15 percent headroom on top. A safe planning ceiling is to assume you can usefully fill about 85 to 90 percent of a card.
| Model size | FP16 weights | FP8 weights | INT4 weights | Practical single-card fit (FP8) |
|---|---|---|---|---|
| 7B | ~14 GB | ~7 GB | ~3.5 GB | Comfortable on one H100 80GB, room for large KV cache |
| 32B | ~64 GB | ~32 GB | ~16 GB | Fits one H100 80GB with healthy KV-cache headroom |
| 70B | ~140 GB | ~70 GB | ~35 GB | Just fits one H100 80GB at FP8; KV cache is tight under concurrency |
A 70B model at FP8 occupies about 70GB of an 80GB card before a single request arrives, leaving only around 10GB for the KV cache. That is fine at low concurrency and dangerous at high concurrency. If you expect heavy parallel load on a 70B, plan for two cards with tensor parallelism, or accept INT4 and validate the quality drop — do not assume a single card will hold steady at scale just because the weights technically fit.
Quantisation: FP8 by default, INT4 to fit
Quantisation reduces the precision of the stored weights to save memory and lift throughput. The two choices that matter in production are FP8 and INT4, and they sit at different points on the quality curve.
FP8 is the default on Hopper-class hardware such as the H100. It is close to lossless for the overwhelming majority of tasks, halves the weight footprint against FP16, and roughly doubles throughput. There is rarely a good reason not to serve FP8 on an H100 if the model supports it. INT4 is a different proposition: it cuts weight VRAM by about 75 percent and can lift throughput by 2.6 to 3.1 times, which is how you fit a large model onto a smaller or cheaper card. The cost is precision. INT4 degrades work that depends on exactness — code generation, mathematics, structured extraction — more than it degrades casual conversation. The discipline is non-negotiable: run your own evaluation set before and after quantising, on your own prompts, and only ship the quantised model if the quality you measure is acceptable for your use case. Never trust a generic benchmark to predict how INT4 behaves on your workload.
Throughput: what one GPU actually serves
Capacity planning needs concrete numbers. Here is what a single H100 80GB realistically delivers with vLLM at steady state as of mid-2026, which lets you work backwards from your traffic to the number of cards you need.
- A 70B model on one H100 80GB sustains roughly 80 to 120 concurrent requests at around 300 to 500 tokens per second aggregate. That is enough for a substantial internal tool or a moderate-traffic product feature, but it is not infinite — model the queue.
- A 7B model on the same H100 sustains roughly 200 to 400 requests per second and over 8,000 tokens per second aggregate. The smaller model is dramatically cheaper per token, which is the whole argument for routing easy work to a small self-hosted model and reserving the large one for hard requests.
The lesson is the same one that governs API cost optimisation: match the model to the task. Many teams self-host a small, fast model for the bulk of their traffic and only reach for a 70B — self-hosted or via API — for the genuinely difficult minority. If you are already thinking in those terms, our companion guides on cutting LLM costs with prompt caching and model routing and on distillation and semantic caching map directly onto a self-hosted fleet.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →The self-host versus API breakeven
This is the section teams most often get wrong, because they compare the GPU-hour rate to the per-token API rate and stop there. The honest comparison includes the labour and the fixed engineering cost of running a service.
The variable cost of self-hosting is GPU rental — and a GPU costs the same whether it is busy or idle. The hidden cost is operations. As of mid-2026, a realistic self-hosted deployment consumes roughly 10 to 20 engineering hours a month in maintenance — patching, monitoring, incident response, model swaps, dependency upgrades — which at senior-engineer rates works out to about 750 to 3,000 US dollars a month in labour, on top of the GPU bill. There is also a one-off cost to build observability, autoscaling and a rollback path before you serve production traffic at all.
| Factor | Self-hosted (vLLM) | Managed API |
|---|---|---|
| Pricing model | Fixed GPU rental (pay whether busy or idle) | Per token (pay only for what you use) |
| Best at | High, steady utilisation | Spiky or low utilisation |
| Ongoing labour | ~10–20 eng hours/month (~$750–$3,000) | Near zero |
| Data residency control | Full — runs in your own region/account | Depends on provider's regions and terms |
| Time to first request | Days to weeks (build the platform) | Minutes (point at an endpoint) |
| Model choice | Any open-weight model, fully controlled | Provider's catalogue and roadmap |
The decision rule that falls out of this is simple. Self-host when at least two of these hold: utilisation is high and steady enough to keep a rented GPU genuinely busy; data residency under DPDP or UK GDPR forces the workload to stay inside a region you control; and you have engineers who can absorb the maintenance load without it derailing the roadmap. Use an API when traffic is spiky or modest, the data is not residency-constrained, and you would rather spend your scarce platform-engineering time on the product. For most early-stage teams in India and the UK, an API is the correct starting point, and self-hosting becomes compelling once a specific high-volume or residency-bound workload emerges. If you want the deeper cost framing, our guide on cutting API costs up to 90 percent with prompt caching is the natural companion to this breakeven.
Data residency: the India and UK angle
For many teams the breakeven maths is moot, because residency makes the decision. Under India's DPDP framework, and under UK GDPR, certain categories of personal data are most safely processed within a controlled jurisdiction — and customer contracts in regulated sectors such as finance, health and government frequently make this explicit. A self-hosted model running in AWS Mumbai (ap-south-1) for an Indian deployment, or AWS London (eu-west-2) for a UK one, keeps both the data and the inference inside a boundary you can attest to in a Data Processing Agreement. An external API in another region introduces a cross-border transfer that you then have to justify, indemnify or engineer around.
GPU availability and pricing differ across these regions, and that feeds back into the breakeven. In India, the IndiaAI compute initiative has materially expanded subsidised GPU access, which can tilt the self-host maths favourably for teams that qualify; for the wider context on India's open-weight and sovereign-AI momentum, see Sarvam open-sourcing its 105B foundational model. In the UK, you are more likely to be sourcing from the major clouds or specialist GPU providers at market rates. The practical move is to price your specific workload in your specific region — Mumbai, London, or a regional specialist — rather than assuming a single global GPU-hour figure, and to weigh that against the residency obligation you are actually under.
Monitoring and observability you cannot skip
Running an API means the provider handles reliability. Self-hosting means you own it, and a self-hosted model with no observability is an incident waiting to happen. Before production traffic, instrument at least four things: GPU utilisation and VRAM (to catch KV-cache pressure before it triggers an out-of-memory), request throughput and queue depth (to know when you need another card), token-level latency including time-to-first-token and inter-token latency (the numbers users actually feel), and error and saturation rates. vLLM exposes Prometheus-compatible metrics out of the box, so the wiring is mostly configuration rather than code. The same observability discipline that production RAG demands applies here; our guide on hybrid retrieval and agent observability covers the tracing side of the same picture. And before you scale to multiple regions, settle the operational basics first — a rollback path, a model-version registry, and a load-testing harness that reproduces your real concurrency.
So — should you self-host?
The honest answer is: probably not yet, and then suddenly yes. Start on an API. Ship the product. Instrument your usage. When a workload emerges that is both high-volume and steady, or that carries a hard residency obligation under DPDP or UK GDPR, run the breakeven with your real numbers and your real labour cost. If two of the three conditions hold, stand up vLLM — it is three commands and an OpenAI-compatible endpoint, so the switch is cheap and reversible. Self-hosting open-weight models in 2026 is a genuine, well-supported option. It is just an option you should reach for deliberately, when the load and the compliance picture earn it, rather than by default.