Deployment & Infra Infra · 7 June 2026 · 15 min read

Self-Host or API? The 2026 LLM Inference Cost-and-Latency Decision

As of June 2026, "should we self-host the model?" has become one of the most expensive questions an engineering team can answer with a gut feeling. The truth is that it is mostly arithmetic plus a few hard constraints. This guide gives you the real cost components, a breakeven method you run on your own numbers, the latency trade-offs, the vLLM serving stack, the serverless middle path, and the India and UK residency angles that override the maths entirely.

AI Tech Connect editorial Published 7 June 2026

The breakeven nobody calculates before they self-host

Most teams that decide to self-host an LLM do it for the wrong reasons. They self-host because the API bill arrived and felt large, or because a board member asked why a strategic capability sits on someone else's infrastructure, or simply because running your own model feels more like engineering than calling an endpoint. These are emotional inputs, not numerical ones, and they tend to produce a decision that looks bold and costs more than the thing it replaced.

The decision is, for the most part, arithmetic. There is a token volume at which owning the hardware becomes cheaper than renting tokens, and below that volume self-hosting loses on cost almost every time. Layered on top of the arithmetic are a small number of hard constraints — data residency, owned fine-tuned weights, a latency floor your product cannot miss — that can override the maths and force a decision regardless of cost. Everything else is noise.

What almost nobody does before committing is run the breakeven calculation honestly. Honestly means counting the costs that do not appear on the GPU invoice: the engineer-hours spent patching, monitoring and recovering the stack; the GPU sitting idle overnight while you still pay for it; the redundancy you need so a single node failure is not an outage. Whether you are an Indian startup weighing IndiaAI's subsidised GPUs or a UK firm with EU data-residency rules, the method is the same. This guide walks through that method, gives you a clearly-labelled illustrative worked example, and tells you when the constraints should win even when the numbers say API. For the wider economic backdrop, our companion piece on AI inference cost economics and FinOps in 2026 sets out why this question matters more every quarter.

The true cost of an API token — and what's hidden in self-host

An API token has the great virtue of being a single, legible number. You pay a published rate per million input and output tokens, the provider runs the GPUs, handles scaling, patches the kernels, absorbs hardware failure, and keeps the lights on. There is no idle cost, because you only pay for tokens you actually use, and there is no operations team, because the operations are the provider's problem. For a workload with variable or modest volume, that simplicity is not a convenience — it is the whole economic argument.

Self-hosting replaces that one legible number with a stack of mostly invisible ones. The GPU rental or purchase is the part everyone sees, but it is rarely the largest part of the true bill. The first hidden cost is idle time: a GPU you rent by the hour is billed whether it is serving a request or sitting at three percent utilisation overnight, and most real traffic is bursty enough that a self-hosted node spends a large fraction of its life idle. The second is engineering: keeping an inference stack healthy in production takes on the order of ten to twenty engineer-hours a month — patching, upgrading the serving engine, chasing memory leaks, tuning batching, responding to incidents — which at senior rates is roughly 750 to 3,000 dollars a month of labour that never appears on the cloud invoice. The third is redundancy and overhead: monitoring, load balancing, a standby node so a single failure is not downtime.

Add these together and a useful rule of thumb emerges: a realistic self-host all-in cost lands at roughly three to five times the raw GPU price. The table below separates the visible from the hidden so you can see why.

Cost component	API (managed)	Self-host (own / rent GPUs)
Per-token / compute	Published rate, pay only for tokens used	GPU rental or amortised purchase — billed even when idle
Idle / utilisation	None — you never pay for unused capacity	Significant — bursty traffic leaves GPUs idle much of the day
Engineering / maintenance	Included — provider patches and tunes	~10–20 engineer-hours/month (~$750–$3,000/mo labour)
Scaling & redundancy	Provider absorbs spikes and failures	You build it — standby nodes, load balancing, monitoring
All-in cost	The quoted rate is the cost	≈ 3–5× the raw GPU price once everything is counted

Pricing as of 2026-06 — GPU prices, API rates and breakeven points move; re-run the maths with current numbers.

Watch out

The single most common error in a self-host business case is comparing the API rate against the raw GPU rental price and stopping there. That comparison ignores idle time, engineer-hours, redundancy and incident response — the costs that make self-host all-in roughly three to five times the GPU sticker price. If your spreadsheet only has two cells and one of them is the GPU hourly rate, you have not modelled the decision; you have modelled the most flattering corner of it.

The breakeven maths, step by step

Breakeven is not a number you can look up, because it genuinely depends on your model, your traffic and how you cost engineering time — and published estimates vary widely. So treat it as a method you run, in four steps.

Step one: tokens per day. Measure your real combined input and output token volume from production logs, not a guess. Step two: GPU-hours needed. From your model's measured throughput on a target GPU (say, output tokens per second from a vLLM benchmark), work out how many GPUs you need to serve peak load with headroom, and therefore how many GPU-hours per month you are committing to. Step three: self-host all-in per month. Take the GPU cost for those hours and multiply by roughly three to five to fold in idle time, the ten to twenty engineer-hours of maintenance and the redundancy overhead. Step four: API per month. Multiply your tokens per day by thirty and by the provider's per-token rate. Whichever number is smaller wins — and you re-run it whenever volume or prices move.

Here is a clearly-labelled illustrative worked example to show the shape of the result. Imagine a workload of about fifty million tokens a day that a small open model handles well. Via API on a small model, the monthly cost works out to roughly 2,250 dollars. Self-hosted on four A10G GPUs sized to serve that load with headroom, the all-in monthly cost — GPU rental plus idle plus maintenance plus redundancy — works out to roughly 5,175 dollars, about 2.3 times more. At this moderate volume, self-hosting loses, and it loses precisely because of the hidden costs, not the GPU rate.

Illustrative workload (~50M tokens/day)	Approx. monthly cost	Notes
Small model via API	≈ $2,250	Pay-per-token, fully managed, no idle, no ops
Self-host on 4×A10G (all-in)	≈ $5,175	GPU + idle + ~10–20 eng-hrs/mo + redundancy (≈ 2.3× more)

Pricing as of 2026-06 — GPU prices, API rates and breakeven points move; re-run the maths with current numbers. The figures above are illustrative and chosen to show the method, not to predict your bill.

Where does the crossover actually sit? Because the inputs vary so much, no single figure is trustworthy, and you should be sceptical of anyone who quotes one as fact. In practice the crossover commonly lands in the high-volume range — on the order of tens to hundreds of millions of tokens per day, depending on model size, your quality bar and how you value engineering time. A team running a small model at sustained, predictable, very high volume can cross into self-host territory; a team running a large model intermittently almost never does. For a concrete walk-through of the calculation at the upper end, our analysis of a DeepSeek V4 Pro self-host breakeven on 8×H100 shows the method applied to a frontier-scale open model, and the broader spend trajectory is in our inference cost economics piece.

Pro tip

Run the breakeven on your sustained volume, not your peak. Self-hosting only wins when GPUs stay busy, so a workload that hits fifty million tokens for two hours and near-zero overnight has a much worse self-host case than its peak suggests, because you pay for idle GPUs the other twenty-two hours. If your traffic is spiky rather than steady, the serverless middle path covered below will usually beat both DIY self-host and a flat frontier API.

Latency: TTFT, throughput, and why self-host can win

Cost is only half the decision. The other half is latency, and it is where self-hosting has a genuine, structural advantage that no amount of API spend can fully buy. Two numbers matter. The first is time-to-first-token (TTFT): how long the user waits, after sending a request, before the first token of the response appears. It governs perceived responsiveness — a chat that starts streaming in 80 milliseconds feels instant, one that takes a second feels sluggish. The second is output throughput, measured in tokens per second, which governs how fast the full answer arrives once it starts and how many concurrent users a GPU can serve.

The reason self-host can win on latency is control. When you own the hardware, you own the queue: there are no noisy neighbours, no shared-tenant contention, and no provider-side rate limiting that throttles you at the worst moment. Public APIs are usually fast, but their tail latency under peak demand is outside your control — when the provider is saturated, your P99 spikes and there is nothing you can do. A self-hosted node serves a known, bounded set of traffic, so its latency is both lower and more consistent. For a product with a hard P99 latency requirement — a real-time voice assistant, a trading-adjacent tool, an interactive coding agent — that predictability can be worth more than the cost difference.

The illustrative figures below show what a well-tuned self-host stack can achieve. They are indicative, not guarantees — your numbers depend on input length, batch size, quantisation and GPU.

Metric	Illustrative self-host figure	What it means
P50 TTFT — 70B FP8 on 1×H100 (vLLM)	≈ 50–150 ms	First token appears fast; chat feels responsive
Output throughput — 70B on 1×A100 (vLLM)	≈ 1,000–3,000 tokens/sec	Aggregate across batched concurrent requests
Tail latency (P99) under load	Bounded — you own the queue	No noisy-neighbour spikes; APIs can spike at peak

Pricing as of 2026-06 — GPU prices, API rates and breakeven points move; re-run the maths with current numbers. Latency figures are illustrative and depend heavily on input length, batch size and quantisation.

vLLM in production: the default open-source serving stack

If you decide to self-host, the question of what to serve with has a clear default answer in 2026: vLLM. It is an open-source inference and serving engine that has become the standard open-weight serving stack, and for good reason. Its headline innovation is PagedAttention, which manages the attention key-value cache the way an operating system manages virtual memory — in non-contiguous pages rather than one large reserved block. That sharply reduces the memory fragmentation that otherwise wastes precious GPU RAM, which in turn lets vLLM keep more requests in flight and batch them together. The result is high throughput: a single A100 serving a 70B model in the rough range of 1,000 to 3,000 output tokens per second across batched requests.

The practical reason vLLM wins, though, is that it is easy to adopt. It ships an OpenAI-compatible server, so the client code you already wrote against a hosted API mostly just works once you change the base URL. Bringing a model up is a single command, and quantisation (FP8, AWQ, GPTQ) is a flag rather than a research project.

# Install and serve an open-weight model with an OpenAI-compatible API.
pip install vllm

# Serve a 70B model in FP8 on the local GPU(s); --tensor-parallel-size
# shards across multiple GPUs if you have them.
vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --quantization fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --port 8000

Once the server is up, point any OpenAI-compatible client at it. Because the API surface matches, switching between a hosted provider and your own vLLM node is often a one-line change to base_url — which also makes it cheap to A/B the self-host decision against an API before you commit.

from openai import OpenAI

# Point the standard OpenAI client at your local vLLM server.
client = OpenAI(
    base_url="http://localhost:8000/v1",   # your vLLM node
    api_key="not-needed-for-local",        # vLLM ignores it by default
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Summarise this support ticket in one line."},
    ],
    max_tokens=128,
    stream=True,        # stream tokens to keep TTFT-perceived latency low
)

for chunk in resp:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

None of this removes the operations burden that the cost section warned about — vLLM makes serving fast, not free of engineers — but it does mean the serving software itself is no longer the hard part. The hard part remains keeping the node, the GPU drivers and the autoscaling healthy in production, month after month.

The middle path: serverless GPU and inference clouds

The self-host-or-API framing is a little too binary, because a third option has matured in between. Serverless GPU and inference clouds — platforms such as Modal, Together, Fireworks, Groq, Cerebras and DeepInfra — host open-weight models for you and bill per use. You get the open models and the control over which weights you run, without buying or babysitting GPUs, and crucially without paying for idle time. They sit precisely between do-it-yourself self-hosting and the frontier APIs, and for a large share of teams they are the right answer.

The case for the middle path is strongest exactly where DIY self-host is weakest: bursty or unpredictable traffic. The silent killer of self-host economics is the idle GPU — enterprise GPU utilisation is often shockingly low, with much hardware sitting near-idle most of the time while the meter runs. Serverless removes that waste by scaling to zero between requests, so you pay for compute only when a token is actually being generated. It also gives you open-weight models that the big frontier-API providers may not offer, and a faster path to production than standing up your own cluster. The trade-off is that at very high, very steady volume the per-use premium can eventually exceed a well-utilised owned cluster — which is, again, a breakeven you calculate rather than assume.

Pro tip

Treat serverless inference clouds as the default starting point for open-weight workloads, and only graduate to DIY self-host once your traffic is provably high and steady enough to keep owned GPUs busy. You get open models, predictable per-token billing and no idle waste on day one, and you keep the option to migrate to your own cluster later — when the breakeven maths, run on real production volume, actually favours it.

India and UK angle: GPU access and data residency

The arithmetic above assumes a global GPU market, but two regional realities can move the answer materially. The first is subsidised access. Under the IndiaAI Mission, Indian builders can reach GPU capacity at rates well below the open market, which changes the self-host side of the breakeven directly: when the GPU input to the calculation is subsidised, the volume at which owning beats renting tokens shifts lower. An Indian startup that would not clear the breakeven on commercial cloud GPUs may clear it comfortably on subsidised capacity, so the same workload can rationally self-host in Bengaluru while it would rationally use an API in another market. Our builder's guide to IndiaAI's cheap GPU access covers how to actually obtain that capacity.

The second reality is regulation, and here the maths can be overridden entirely. UK and EU data-residency requirements, and the obligations arriving under the EU AI Act, push some workloads on-prem or into a specific region regardless of cost. A UK firm handling regulated personal or health data, or one bound by EU data-residency rules, may simply be unable to send that data to a general-purpose API endpoint — in which case the decision is not "which is cheaper" but "which is permitted". The practical pattern is to keep the GPU close to the data: AWS Mumbai or London region for a residency-bound workload, or fully on-prem where the rules demand it. When a constraint like this applies, you run the breakeven anyway — but only across the options that are actually legal for your data.

The decision checklist and next steps

Strip away the detail and the decision comes down to a short go/no-go checklist. Self-host only if you can answer yes to at least one of the first three and yes to the fourth:

Volume: Is your sustained volume genuinely in the high range — on the order of tens of millions of tokens a day and up — on a workload an open model handles well?
Residency: Do hard data-residency or regulatory rules (EU AI Act, UK/EU residency, DPDP-bound data) require the model to run in a specific place or on-prem?
Owned weights: Do you need fine-tuned weights you own and control, rather than a hosted model you cannot inspect or move?
Team: Do you actually have the people to run it — the ten to twenty engineer-hours a month, the on-call, the monitoring — month after month?

If none of the first three is a clear yes, an API is almost certainly cheaper and simpler. If your traffic is bursty, the serverless middle path beats both. If you do clear the bar, serve with vLLM, run the breakeven on real volume, and re-run it whenever prices move. Once you have chosen, the next job is cutting the bill on whichever path you took — our guide to LLM cost optimisation with cache, route and compress picks up exactly there.

If you are the engineer who modelled this honestly and shipped the inference stack that serves your product reliably and affordably, that is precisely the kind of proof-of-work that the people hiring in AI want to see. A Verified Builder profile on AI Tech Connect is where you put it.

Every article here is written by a Verified Builder. Want your name on the next one?

AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.

Become a Verified Builder →

Frequently asked

Is self-hosting an LLM cheaper than an API?

For most teams in 2026, no. Once you count idle GPU time, DevOps overhead and engineer hours, API access is usually cheaper than self-hosting, and the maths only flips at very high volume or when privacy and residency force you on-prem. The headline GPU rental price is the smallest part of the bill: a realistic self-host all-in cost runs roughly three to five times the raw GPU price once you add maintenance, monitoring, redundancy and the ten to twenty engineer-hours a month it takes to keep the stack healthy. The right way to settle it is to run the breakeven calculation on your own token volume rather than trust any single quoted figure.

At what volume does self-hosting break even?

The honest answer is that breakeven is a calculation you run, not a fixed number, and published estimates vary widely. The crossover commonly lands in the high-volume range — on the order of tens to hundreds of millions of tokens per day — depending on the model size, your quality bar and how you value engineering time. At moderate volume self-hosting almost always loses: an illustrative workload of around fifty million tokens a day costs roughly 2,250 dollars a month on a small model via API but roughly 5,175 dollars a month self-hosted on four A10G GPUs, about 2.3 times more. Plug your own tokens-per-day, GPU-hours and all-in monthly costs into the method in this guide and let the arithmetic decide.

What is vLLM and why is it the default?

vLLM is an open-source inference and serving engine that has become the default open-weight serving stack in 2026. Its core innovation, PagedAttention, manages the attention key-value cache the way an operating system manages virtual memory, which sharply reduces memory waste and lets the server batch many requests together for high throughput. In practice a single A100 can serve a 70B model at roughly 1,000 to 3,000 output tokens per second, and vLLM exposes an OpenAI-compatible API so you can point existing client code at vllm serve <model> with minimal changes. It is open source, well maintained and fast, which is why most teams that self-host reach for it first.

Does self-hosting give lower latency?

It can, and this is often the strongest non-cost reason to self-host. Because you own the hardware, you control the queue and avoid the noisy-neighbour spikes that shared APIs can suffer at peak demand, so your latency is lower and more consistent. As an illustration, a 70B FP8 model on a single H100 served with vLLM can reach a P50 time-to-first-token of roughly 50 to 150 milliseconds for typical input lengths, with steady output throughput. Public APIs are usually fast too, but their tail latency under load is outside your control. If predictable P99 latency is a hard product requirement, self-hosting gives you levers an API cannot.

When should I use a serverless GPU platform instead?

Reach for a serverless GPU or inference cloud — Modal, Together, Fireworks, Groq, Cerebras or DeepInfra — when you want to run open-weight models without owning idle GPUs. These platforms host open models for you and bill per use, so they sit between do-it-yourself self-hosting and frontier APIs. They win when your traffic is bursty or unpredictable, when you want an open model that the big API providers do not offer, or when your own GPUs would sit idle most of the day. Idle GPUs are the silent cost of self-hosting, with enterprise utilisation often shockingly low, and serverless removes that waste while keeping you on open weights.

Shipped a self-hosted inference stack? Put it on your Builder profile.

AI Tech Connect is the directory where Indian and UK AI Builders get found by the people hiring and collaborating. If you modelled the breakeven and shipped reliable, affordable inference, that is exactly the proof-of-work worth showing. Claim your free Founding Builder profile while early spots are open — two minutes, no CV.

Create your free profile Browse Builders

← Back to AI Tips