Something shifted in the week ending 16 May 2026. Three separate teams we spoke with — a fintech startup in Bengaluru, an NHS digital contractor in Manchester, and an independent agent-framework builder in Pune — all independently reached the same conclusion: the cheapest route to frontier-grade coding assistance is no longer a subscription. It is a download.
The model they are all downloading is DeepSeek V4. And after spending time with the benchmarks, the architecture notes, and the deployment logistics, we think they are right.
What makes DeepSeek V4 different from V3
DeepSeek V3, released late 2025, was already a formidable open-weight model. V4 is a different class of system. The headline numbers: 1.6 trillion total parameters arranged in a Mixture-of-Experts (MoE) architecture, a 1-million-token context window, a LiveCodeBench score of 93.5 (ranked first across all evaluated models), and a Codeforces rating of 3206 — a level that corresponds to Grandmaster in competitive programming.
The MoE architecture deserves a moment of attention because it is central to why V4 is practical to run at all. In a dense transformer like GPT-3 or Llama 2, every parameter is active for every token. At 1.6 trillion parameters, a fully dense model would require prohibitive compute to run. MoE sidesteps this by routing each token to a small subset of "expert" sub-networks — typically around 37 billion parameters are active per forward pass in V4's configuration. The rest of the weights sit dormant in memory, not burning FLOPs.
This distinction matters enormously for inference economics. A 1.6T MoE model does not behave like a 1.6T dense model at runtime. It behaves more like a 37B dense model in terms of compute, while retaining the representational capacity of the full 1.6T parameter set accumulated through training. The result is a model that is both smarter than a 37B dense model and considerably cheaper to run than a 1.6T dense model would be.
The 1-million-token context window is the other architectural leap. GPT-4o's context sits at 128,000 tokens, making DeepSeek V4's window roughly eight times larger. For builders working on codebase-wide refactoring, long-document legal analysis, or full-repository code review, this is not a marginal improvement. It fundamentally changes what you can hand to the model in a single call.
On benchmarks, V4's LiveCodeBench score of 93.5 places it first on a leaderboard that includes GPT-4o, Claude Opus 4.7, and Gemini 2.5 Pro. LiveCodeBench matters because it uses competitive programming problems released after each model's training cutoff, which prevents dataset contamination — the benchmark gaming that plagues many LLM evaluation tables. A 93.5 on LiveCodeBench means the model is solving problems that it almost certainly did not memorise during training. The Codeforces rating of 3206 reinforces this: Codeforces ratings are earned through real competition performance, not static test sets.
The model is released on HuggingFace under an Apache 2.0 licence. That means you can download it, modify it, serve it commercially, and redistribute it — without paying DeepSeek a penny.
Self-hosting hardware requirements
Let us be direct: DeepSeek V4 is a large model and running it requires serious hardware. The table below gives realistic estimates for different deployment configurations based on MoE inference behaviour.
| Setup | VRAM required | GPU configuration | Throughput (est.) | Best for |
|---|---|---|---|---|
| Full precision (FP16) | ~640 GB | 8× A100 80 GB | ~20–30 tok/s | Production, maximum quality |
| INT8 quantisation | ~320 GB | 4× A100 80 GB | ~25–40 tok/s | Production, cost-quality balance |
| INT4 quantisation | ~160 GB | 4× A100 40 GB | ~35–55 tok/s | High-throughput batch workloads |
| INT4 + offloading | ~80 GB GPU + RAM | 2× A100 40 GB + 512 GB RAM | ~5–10 tok/s | Experimentation, low-volume use |
Throughput figures assume a single-user workload. Under concurrent load, per-user throughput drops — size your cluster accordingly. Tensor parallelism across multiple GPUs is handled automatically by vLLM and SGLang, the two frameworks most commonly used with this model class.
Deployment walkthrough with vLLM
vLLM is the recommended serving framework for DeepSeek V4. It handles tensor parallelism, continuous batching, and chunked prefill — the last of which is essential for 1M-token context workloads. Here is a minimal production-ready deployment command:
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--quantization int8 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.92 \
--host 0.0.0.0 \
--port 8000
A few flags merit explanation:
--tensor-parallel-size 8 distributes the model shards across eight GPUs. Adjust this to match your actual GPU count — four for a four-GPU INT8 setup.
--max-model-len 131072 sets the effective context window to 128K tokens for the initial deployment. You can increase this to 1048576 (1M) once you have validated your VRAM headroom, but 128K is a prudent starting point that avoids OOM errors on first boot.
--enable-chunked-prefill is non-negotiable for long-context workloads. Without it, a single 100K-token prefill stalls all other requests while the KV cache is being populated. Chunked prefill breaks long prompts into interleaved processing slices, keeping latency reasonable for other concurrent users.
--quantization int8 activates AWQ-style INT8 weight quantisation. For most coding tasks — including the benchmark scores cited in this article — INT8 introduces no perceptible quality degradation. We recommend starting here rather than FP16 unless you are running ablation studies.
For memory-efficient attention, vLLM uses PagedAttention by default, which manages the KV cache in non-contiguous memory pages. This alone can reduce peak VRAM usage by 20–40% compared to naive implementations and is one reason vLLM remains the preferred choice over Hugging Face's native generation pipeline for production deployments.
An alternative worth evaluating is SGLang, which offers structured generation primitives and RadixAttention — a mechanism that reuses shared KV cache prefixes across requests. If your application repeatedly sends the same large system prompt, SGLang's prefix caching can meaningfully increase effective throughput.
When to use DeepSeek V4 versus closed APIs
The open-weight versus closed-API decision is not purely technical. It involves cost, data sensitivity, operational overhead, and latency requirements. The table below summarises the honest trade-offs.
| Criteria | DeepSeek V4 (self-hosted) | GPT-4o API | Claude Opus 4.7 |
|---|---|---|---|
| Marginal cost per token | £0 / ₹0 (hardware amortised) | ~$5–15 / MTok input | ~$15 / MTok input |
| Coding benchmark (LiveCodeBench) | 93.5 (ranked #1) | Low 80s | Mid 80s |
| Context window | 1M tokens | 128K tokens | 200K tokens |
| First-token latency | Higher (cold start + prefill) | Low (hosted, optimised) | Low (hosted, optimised) |
| Data sovereignty | Full — data never leaves your infra | Data processed by OpenAI | Data processed by Anthropic |
| Setup complexity | High — GPU cluster, vLLM, monitoring | None — API key only | None — API key only |
| Ongoing maintenance | Your team owns it | Vendor managed | Vendor managed |
| Commercial licence | Apache 2.0 — fully permissive | Vendor ToS applies | Vendor ToS applies |
The break-even point is roughly 100–150 million tokens per month at current A100 spot prices on major cloud providers. Below that volume, a closed API is almost certainly cheaper when you account for engineering time and infrastructure overhead. Above it, self-hosting compounds savings rapidly.
For a sense of scale: 150 million tokens per month is equivalent to roughly 500 full coding sessions a day — each with a 10,000-token context. A team of 20 engineers using an AI coding assistant daily will comfortably hit that volume.
The Indian and UK builder advantage
Open-weight self-hosting is not merely a cost story. For teams in India and the United Kingdom, it has specific legal and strategic dimensions that closed APIs cannot address.
Data sovereignty under GDPR. UK teams processing personal data through a third-party AI API incur Article 28 data-processor obligations — you must sign a Data Processing Agreement, audit the vendor's security posture, and document the legal basis for the transfer. When the model runs inside your own infrastructure (whether on-premises or in a GDPR-compliant cloud region you control), the personal data never leaves your processing environment. Third-party processor obligations vanish. This is not a theoretical benefit: it is the reason multiple NHS digital contractors are actively piloting self-hosted models for anything touching patient records.
DPDP compliance for Indian teams. India's Digital Personal Data Protection Act 2023 creates similar obligations around data localisation and consent for certain categories of personal data. Building on a self-hosted model that processes all data within Indian cloud infrastructure (AWS Mumbai, Azure Central India, or IndiaAI Mission nodes) substantially simplifies your DPDP compliance posture. For fintech, healthtech, and edtech startups — sectors where DPDP enforcement is expected to tighten through 2026 — this is a meaningful risk reduction.
Cost economics at Indian GPU prices. The IndiaAI Mission GPU cluster offers A100 access at approximately ₹115–150 per GPU-hour. A four-GPU INT8 setup therefore costs ₹460–600 per hour. Running it 12 hours a day, 22 working days a month, costs ₹121,440–158,400 per month — roughly $1,450–$1,900 USD. At $5/MTok input pricing for a closed API, that monthly budget buys you 290–380 million input tokens. If your team is generating more than that, self-hosting has already crossed break-even.
Competitive moat. The broader May 2026 trend is unmistakable: open-weight model quality has converged to within 5–15 benchmark points of closed frontier models on most task categories. That gap is narrowing every quarter. Teams that invest in the infrastructure and operational expertise to run these models now are building a capability that is genuinely difficult to replicate quickly. The ability to deploy a frontier-class coding model inside your own infrastructure — with no ongoing per-token cost — is a structural advantage in any business where AI usage scales with product growth.
What it still cannot do well
We have spent considerable time making the case for DeepSeek V4 self-hosting. The case is strong. But a good deployment decision requires an honest accounting of the downsides.
Latency versus closed APIs. OpenAI and Anthropic have invested hundreds of millions of pounds in custom inference infrastructure, speculative decoding pipelines, and network-edge serving. A four or eight GPU cluster run by an engineering team will not match their first-token latency at low concurrency. The gap is typically 300–800ms on short prompts — noticeable in interactive applications, less relevant in batch or background processing.
Cold start times. Loading DeepSeek V4's weights into GPU memory from disk takes 8–20 minutes depending on storage throughput and GPU interconnect speed. This makes serverless-style scaling — spinning up new instances on demand — impractical. You will need to maintain warm instances, which means paying for GPU time even when the model is idle. Plan your scaling strategy accordingly.
Maintenance burden. Running a production LLM cluster is a distinct engineering discipline. You will need monitoring for GPU memory fragmentation, strategies for handling OOM events gracefully, a process for applying security patches to the serving stack, and operational runbooks for common failure modes. If your team does not have that expertise today, factor in the time cost of building it.
Model updates. DeepSeek will release improved versions of V4 — and when they do, you will need to re-download, re-quantise, and re-validate the new weights before deploying. Closed API providers update their models without any action required on your side. Whether that is a feature or a bug depends on how much you value stability versus capability currency.
None of these limitations are disqualifying for the right workload profile. Batch code review, CI/CD pipeline integration, overnight codebase analysis, document processing, and legal tech applications all tolerate higher latency and benefit enormously from zero marginal cost. For those use cases, DeepSeek V4 self-hosting is, on balance, the correct choice for any team processing meaningful volume.
For a detailed comparison of DeepSeek V4's API-based cost versus GPT-4o and MMLU benchmark positioning, see our earlier cost comparison piece. This article focuses specifically on self-hosted deployment, which is a distinct decision with different economics and requirements.
The open-weight era is not coming. It is here. DeepSeek V4 is the clearest evidence yet that builders who choose to own their inference stack are not making a compromise — they are making a bet that looks increasingly well-placed with every passing quarter.