The DeepSeek V4-Pro economics that changed the self-host conversation

As of 22 May 2026, DeepSeek made its post-promotional pricing permanent: $0.435 per million input tokens, $0.87 per million output tokens, and $0.003625 per million cache-hit input tokens. Set against the closed frontier APIs — which still sit in the $3 to $15 per million range — V4-Pro is, on paper, between 5× and 30× cheaper at the wire. And because the weights are open, you also have the option not to use DeepSeek's API at all.

That second option is what this piece is about. The managed API price is so low that the self-host conversation has flipped on its head. A year ago, "should we run our own LLM" was usually a sovereignty question with cost as a secondary tailwind. With V4-Pro it is the inverse: cost is now the headwind, because the managed price is genuinely hard to beat. The only situations where running your own 8× H100 cluster wins are (a) very high steady-state volume, or (b) data residency and contractual constraints that simply rule out the managed endpoint.

This guide walks through the numbers honestly — including the cases where the answer is "stay on the API". V4-Pro scores 80.6% on SWE-bench Verified and the higher-tier V4-Pro-Max hits 93.5 on LiveCodeBench, so the question is not "is the model good enough"; it is "is your workload big enough or constrained enough to justify the operational lift". See our earlier coverage: DeepSeek V4-Pro Hits 80.6% on SWE-Bench.

Model Input ($/MTok) Output ($/MTok) Cache-hit input ($/MTok) Source
DeepSeek V4-Pro $0.435 $0.87 $0.003625 Artificial Analysis
Claude Opus 4.7 $5.00 $25.00 $0.50 Anthropic pricing page
GPT-5.5 (typical) ~$3.00 ~$15.00 ~$0.30 OpenAI pricing
V4-Pro via OpenRouter $0.40–$0.60 $0.85–$1.10 Provider-dependent OpenRouter
Pro tip

Before you spec a single GPU, instrument your existing workload for cache-hit rate. V4-Pro's cache-hit input price is roughly two orders of magnitude below fresh input. If you can keep a stable system prompt and rolling context, the effective per-token cost drops by 99% on the cached portion — which often kills the self-host case on its own.

The two break-even drivers — volume and data sovereignty

There are exactly two reasons to self-host V4-Pro. Everything else is rationalisation.

Reason one: volume. Per Spheron's deployment analysis and DeepInfra's hosting cost breakdown, the rule of thumb is that self-hosting starts to make sense once your steady-state managed-API spend crosses roughly $3,000–$5,000 per month. Below that, an 8× H100 node at ap-south-1 or eu-west-2 spot pricing — even with 60–70% utilisation — costs more than the equivalent API bill once you add the DevOps engineer-hours to keep vLLM running. Above that line, your own node starts to amortise.

Translated into tokens: at $0.435 input and $0.87 output, $4,000 per month of API consumption is roughly 50 million tokens per day on a balanced 70/30 input-output mix with moderate prefix caching. That is real production traffic — a mid-size SaaS agent product, a busy internal copilot at a 500-engineer company, or a high-volume document-extraction pipeline at a regulated UK insurer.

Reason two: data sovereignty. Some workloads simply cannot leave a jurisdiction. India's DPDP regime, NHS clinical data, UK Financial Conduct Authority oversight on customer chat logs, and certain EU AI Act high-risk classifications all narrow your hosting options in ways that the managed DeepSeek endpoint cannot satisfy. If the choice is "self-host or do not deploy", the economics are whatever they need to be.

Avoid

Do not chase a 200M-token-per-day self-host workload without spot-instance reservations or a committed-use contract. On-demand H100 pricing in ap-south-1 and eu-west-2 ranges from $3.50 to $4.50 per GPU-hour; an 8× node on demand is $25,000+ per month of compute alone. Spot or three-year reserved capacity is the difference between this paying off and not.

8× H100 vs 8× H200 — what you actually need

V4-Pro is a roughly 1.6-trillion-parameter mixture-of-experts model. In native FP8 it does not fit on one GPU. The standard production configuration is 8× H100 SXM5 (80GB) or 8× H200 SXM5 (141GB) wired with NVLink and served with tensor-parallel plus expert-parallel routing.

Config HBM per GPU NVLink BW Indicative price (IN/UK clouds, on-demand) Throughput band
8× H100 SXM5 80 GB HBM3 900 GB/s $28–$36/hr ~800–1,300 tok/s decode
8× H200 SXM5 141 GB HBM3e 900 GB/s $38–$48/hr ~1,100–1,800 tok/s decode
8× H100 spot (preemptible) 80 GB HBM3 900 GB/s $12–$18/hr Same as on-demand, with interruption risk

Throughput estimates are per the Spheron deployment notes and consistent with community vLLM benchmarks; your prompt mix will shift these meaningfully. Long-context inference (above 200k tokens) cuts decode throughput substantially because the KV cache eats HBM. H200 is materially better there — the extra 61 GB per GPU is the difference between 64k and 200k context at batch-32 without spilling.

Watch out

The price gap between H100 and H200 is narrower than the marketing suggests, but the H200 wins on long-context economics. If your workload is short-prompt agent calls (under 8k tokens), H100 spot is the cheapest serious option. If you are doing 100k+ context document review or repo-wide refactors, the H200 amortises its premium quickly through better batch density.

vLLM expert-parallelism for the 1.6T MoE

vLLM and SGLang are the two production-ready inference servers that understand DeepSeek's MoE routing. The serving command below is the minimal viable starting point for 8× H100; tune --max-num-batched-tokens and the prefix-cache settings against your traffic.

vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --dtype fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --port 8000 \
  --served-model-name deepseek-v4-pro

Three flags do most of the work here. --enable-expert-parallel shards the MoE experts across the eight GPUs so each token only activates a subset; without it, V4-Pro is essentially untenable on a single node. --enable-prefix-caching is non-optional in production — your effective cost per output token depends on the cache hit rate more than on any other knob. And --kv-cache-dtype fp8_e5m2 halves your KV memory footprint with minimal quality loss, which is what lets you push to 131k context on H100.

Recommended

Treat --enable-prefix-caching like Anthropic's prompt_caching and OpenAI's cached input pricing — the cost model assumes it. A workload with a 70% cache-hit rate has an effective input cost that is roughly 30% of the headline number. Without prefix caching, your 8× H100 economics fall apart well before the volume break-even.

The single-node escape hatch — V4-Flash with 4-bit quantisation

If 8× H100 sounds like overkill for your use case, V4-Flash is the model you actually want. Per the Spheron deployment guide and the DeepInfra hosting analysis, V4-Flash at AWQ or GPTQ 4-bit quantisation fits on a single H100 80GB or even on a 2× L40S configuration. The quality gap to V4-Pro is real on hard reasoning tasks but unimportant for the majority of agent calls, document classification, and structured-output workloads.

This is the configuration we recommend for most teams considering self-hosting for the first time. A single-node deployment is one-tenth the operational complexity of an 8-GPU tensor-parallel cluster: no NVLink topology debugging, no expert-routing thermals, no multi-GPU OOM surprises. For Indian teams running DPDP-bound workloads, or UK regulated-data pipelines that need an in-region inference layer, V4-Flash on a single H100 is the pragmatic starting point. Promote to V4-Pro on 8× H100 only when you have measured a workload that genuinely needs the larger model.

Doing the math: monthly cost models for three example workloads

The table below works three concrete workloads through the API-vs-self-host comparison. All numbers assume a 70/30 input-output ratio and 50% prefix-cache hit rate on the API side; self-host figures assume 65% utilisation on spot pricing.

Workload Daily tokens API cost/month Self-host cost/month (8× H100 spot) Verdict
Small SaaS agent 10M tok/day ~$170 ~$8,500 API wins by 50×. Stay on the API.
Mid-size internal copilot 50M tok/day ~$850 ~$8,500 API still wins on pure cost; self-host only if sovereignty applies.
High-volume extraction pipeline 200M tok/day ~$3,400 ~$8,500–$11,000 Closer. Self-host pays off with reserved capacity or sovereignty constraints.

The pattern is uncomfortable but important: at DeepSeek's published API rates, even 200 million tokens per day still costs less on the managed endpoint than on your own cluster at retail spot pricing. The volume break-even on cost alone is materially higher than the older Spheron and DeepInfra estimates suggested — those analyses were written before the post-promotional pricing dropped. If you want self-hosting to win on cost in 2026, you need three things stacked together: committed-use or reserved compute (cutting the $8,500 floor roughly in half), high utilisation (above 70%), and a workload that benefits from your own prefix-cache topology rather than the provider's.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

When the API still wins (and that is most of the time)

For the majority of teams reading this, the answer is "stay on the API, possibly via OpenRouter for routing flexibility, and revisit in twelve months". The DeepSeek managed endpoint at $0.435/$0.87 is genuinely difficult to beat with a self-hosted setup unless you bring volume, reserved compute, and operational maturity to the table.

The cases where self-hosting clearly wins are narrow but real:

  • Regulated data that cannot leave a jurisdiction — UK NHS clinical workloads, Indian DPDP-bound customer records, EU AI Act high-risk applications.
  • Latency-sensitive pipelines where the round-trip to DeepSeek's Beijing endpoint is the bottleneck, particularly for European users.
  • Steady-state high-volume batch work with a committed three-year compute reservation that drops the $25,000/month on-demand floor to $8,000–$12,000.
  • Customisation beyond the API surface — speculative decoding tweaks, custom routing logic across multiple checkpoints, or research workloads on the raw weights.

What this means for IN data-sovereignty teams and UK regulated workloads

India's DPDP framework and the UK's combination of NHS Digital data rules, FCA model-governance expectations, and EU AI Act spillover all push some workloads off any non-domestic managed API by default. For those teams, the V4-Pro economics in this article matter less than the answer to a different question: can you produce a contractually clean, in-region inference layer for under your current compliance budget?

The honest answer for most Indian and UK organisations is yes, but not at frontier-API price parity. Expect to pay 3–5× the equivalent managed-API cost for an in-region 8× H100 or 8× H200 deployment, in exchange for the sovereignty story. Indian builders should look at compute partners in ap-south-1 with explicit DPDP-aligned data processing agreements; UK builders should target eu-west-2 or London-specific colocation with NHS Digital and FCA-friendly contractual frames. The compute itself is fungible — the legal and audit packaging is what you are actually paying for.

Five mistakes builders make when running these numbers

  1. Comparing on-demand self-host pricing to API pricing. The API is always cheaper than on-demand 8× H100. The only fair comparison is spot or reserved capacity at high utilisation. If you have not modelled both, you have not modelled this.
  2. Ignoring cache-hit economics on the API side. DeepSeek's $0.003625 cache-hit input price means a workload with stable system prompts pays nothing close to the headline rate. Your self-host configuration must beat the cached-API cost, not the cold-API cost.
  3. Underestimating DevOps overhead. Running a multi-node tensor-parallel cluster in production is a half-time engineer's job, minimum, before you start handling weight updates, kernel security patches, and provider outages. Price that in at $80,000–$150,000 per year in IN and UK markets.
  4. Skipping V4-Flash. For most agent and extraction workloads, V4-Flash on a single H100 with 4-bit quantisation is the right starting point. V4-Pro on 8× H100 is the production destination only if you have measured a workload that needs the larger model — see our coverage of DeepSeek V4 as the open-weight coding giant builders can now self-host.
  5. Pricing in a static world. DeepSeek dropped its API price by roughly 60% across 2026. Your self-host build assumed today's prices; in twelve months the managed endpoint may be cheaper still. Build in a re-evaluation point at month nine of any committed-compute contract.

For an honest read on whether the open-weight movement is the right destination at all, see our recent piece on DeepSeek V4 and the frontier-cost question. Primary sources for the numbers in this article: Artificial Analysis, Spheron's deployment guide, and DeepInfra's 2026 pricing analysis.