How much VRAM do I need to run DeepSeek V4?

Running DeepSeek V4 at full FP16 precision requires approximately 640 GB of VRAM — typically eight A100 80 GB GPUs. For most production teams, INT8 quantisation halves this requirement to around four A100 80 GB GPUs. INT4 quantisation brings it down further to two to four A100 40 GB GPUs, though with some quality trade-off on the most complex coding tasks. For experimentation, a four-GPU setup with INT8 is the recommended starting point.

Is DeepSeek V4 genuinely better than GPT-4o at coding?

On LiveCodeBench — which tests real competitive programming problems rather than toy exercises — DeepSeek V4 scores 93.5 against GPT-4o's score in the low 80s. On Codeforces, its rating of 3206 puts it at grandmaster level. These are not cherry-picked benchmarks: LiveCodeBench uses problems released after model training cutoffs specifically to prevent contamination. For competitive programming, algorithm design, and complex multi-file refactoring tasks, the evidence strongly favours DeepSeek V4.

Can I use DeepSeek V4 commercially?

Yes. DeepSeek V4 is released under the Apache 2.0 licence, which permits commercial use, modification, and redistribution without royalty payments. You must retain the original licence and copyright notices. There are no restrictions on building commercial products or charging for services that use DeepSeek V4 as the underlying model. Always review the full licence text and your local legal requirements before deploying.

Does self-hosting satisfy GDPR and DPDP requirements?

Self-hosting gives you the infrastructure control needed to satisfy both GDPR (for UK and EU teams) and India's DPDP Act (for Indian teams). When the model runs entirely within your own infrastructure, user data never leaves your network perimeter, eliminating third-party data-processor obligations. However, self-hosting alone does not guarantee compliance — you still need appropriate access controls, audit logging, data-retention policies, and a legitimate legal basis for processing. Treat it as a necessary but not sufficient condition.

DeepSeek V4: The Open-Weight Coding Giant Builders Can Now Self-Host

Something shifted in the week ending 16 May 2026. Three separate teams we spoke with — a fintech startup in Bengaluru, an NHS digital contractor in Manchester, and an independent agent-framework builder in Pune — all independently reached the same conclusion: the cheapest route to frontier-grade coding assistance is no longer a subscription. It is a download.

The model they are all downloading is DeepSeek V4. And after spending time with the benchmarks, the architecture notes, and the deployment logistics, we think they are right.

What makes DeepSeek V4 different from V3

DeepSeek V3, released late 2025, was already a formidable open-weight model. V4 is a different class of system. The headline numbers: 1.6 trillion total parameters arranged in a Mixture-of-Experts (MoE) architecture, a 1-million-token context window, a LiveCodeBench score of 93.5 (ranked first across all evaluated models), and a Codeforces rating of 3206 — a level that corresponds to Grandmaster in competitive programming.

The MoE architecture deserves a moment of attention because it is central to why V4 is practical to run at all. In a dense transformer like GPT-3 or Llama 2, every parameter is active for every token. At 1.6 trillion parameters, a fully dense model would require prohibitive compute to run. MoE sidesteps this by routing each token to a small subset of "expert" sub-networks — typically around 37 billion parameters are active per forward pass in V4's configuration. The rest of the weights sit dormant in memory, not burning FLOPs.

This distinction matters enormously for inference economics. A 1.6T MoE model does not behave like a 1.6T dense model at runtime. It behaves more like a 37B dense model in terms of compute, while retaining the representational capacity of the full 1.6T parameter set accumulated through training. The result is a model that is both smarter than a 37B dense model and considerably cheaper to run than a 1.6T dense model would be.

The 1-million-token context window is the other architectural leap. GPT-4o's context sits at 128,000 tokens, making DeepSeek V4's window roughly eight times larger. For builders working on codebase-wide refactoring, long-document legal analysis, or full-repository code review, this is not a marginal improvement. It fundamentally changes what you can hand to the model in a single call.

On benchmarks, V4's LiveCodeBench score of 93.5 places it first on a leaderboard that includes GPT-4o, Claude Opus 4.7, and Gemini 2.5 Pro. LiveCodeBench matters because it uses competitive programming problems released after each model's training cutoff, which prevents dataset contamination — the benchmark gaming that plagues many LLM evaluation tables. A 93.5 on LiveCodeBench means the model is solving problems that it almost certainly did not memorise during training. The Codeforces rating of 3206 reinforces this: Codeforces ratings are earned through real competition performance, not static test sets.

The model is released on HuggingFace under an Apache 2.0 licence. That means you can download it, modify it, serve it commercially, and redistribute it — without paying DeepSeek a penny.

Self-hosting hardware requirements

Let us be direct: DeepSeek V4 is a large model and running it requires serious hardware. The table below gives realistic estimates for different deployment configurations based on MoE inference behaviour.

Setup	VRAM required	GPU configuration	Throughput (est.)	Best for
Full precision (FP16)	~640 GB	8× A100 80 GB	~20–30 tok/s	Production, maximum quality
INT8 quantisation	~320 GB	4× A100 80 GB	~25–40 tok/s	Production, cost-quality balance
INT4 quantisation	~160 GB	4× A100 40 GB	~35–55 tok/s	High-throughput batch workloads
INT4 + offloading	~80 GB GPU + RAM	2× A100 40 GB + 512 GB RAM	~5–10 tok/s	Experimentation, low-volume use

Throughput figures assume a single-user workload. Under concurrent load, per-user throughput drops — size your cluster accordingly. Tensor parallelism across multiple GPUs is handled automatically by vLLM and SGLang, the two frameworks most commonly used with this model class.

Pro tip: If you are running on the IndiaAI Mission GPU cluster (available at ₹115–150/hour for A100 nodes), INT8 quantisation on a 4× A100 80 GB configuration is the sweet spot. You get near-FP16 quality at half the memory footprint, and the per-hour cost makes overnight batch jobs economical.

Deployment walkthrough with vLLM

vLLM is the recommended serving framework for DeepSeek V4. It handles tensor parallelism, continuous batching, and chunked prefill — the last of which is essential for 1M-token context workloads. Here is a minimal production-ready deployment command:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4 \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --quantization int8 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 \
  --port 8000

A few flags merit explanation:

--tensor-parallel-size 8 distributes the model shards across eight GPUs. Adjust this to match your actual GPU count — four for a four-GPU INT8 setup.

--max-model-len 131072 sets the effective context window to 128K tokens for the initial deployment. You can increase this to 1048576 (1M) once you have validated your VRAM headroom, but 128K is a prudent starting point that avoids OOM errors on first boot.

--enable-chunked-prefill is non-negotiable for long-context workloads. Without it, a single 100K-token prefill stalls all other requests while the KV cache is being populated. Chunked prefill breaks long prompts into interleaved processing slices, keeping latency reasonable for other concurrent users.

--quantization int8 activates AWQ-style INT8 weight quantisation. For most coding tasks — including the benchmark scores cited in this article — INT8 introduces no perceptible quality degradation. We recommend starting here rather than FP16 unless you are running ablation studies.

For memory-efficient attention, vLLM uses PagedAttention by default, which manages the KV cache in non-contiguous memory pages. This alone can reduce peak VRAM usage by 20–40% compared to naive implementations and is one reason vLLM remains the preferred choice over Hugging Face's native generation pipeline for production deployments.

An alternative worth evaluating is SGLang, which offers structured generation primitives and RadixAttention — a mechanism that reuses shared KV cache prefixes across requests. If your application repeatedly sends the same large system prompt, SGLang's prefix caching can meaningfully increase effective throughput.

When to use DeepSeek V4 versus closed APIs

The open-weight versus closed-API decision is not purely technical. It involves cost, data sensitivity, operational overhead, and latency requirements. The table below summarises the honest trade-offs.

Criteria	DeepSeek V4 (self-hosted)	GPT-4o API	Claude Opus 4.7
Marginal cost per token	£0 / ₹0 (hardware amortised)	~$5–15 / MTok input	~$15 / MTok input
Coding benchmark (LiveCodeBench)	93.5 (ranked #1)	Low 80s	Mid 80s
Context window	1M tokens	128K tokens	200K tokens
First-token latency	Higher (cold start + prefill)	Low (hosted, optimised)	Low (hosted, optimised)
Data sovereignty	Full — data never leaves your infra	Data processed by OpenAI	Data processed by Anthropic
Setup complexity	High — GPU cluster, vLLM, monitoring	None — API key only	None — API key only
Ongoing maintenance	Your team owns it	Vendor managed	Vendor managed
Commercial licence	Apache 2.0 — fully permissive	Vendor ToS applies	Vendor ToS applies

The break-even point is roughly 100–150 million tokens per month at current A100 spot prices on major cloud providers. Below that volume, a closed API is almost certainly cheaper when you account for engineering time and infrastructure overhead. Above it, self-hosting compounds savings rapidly.

For a sense of scale: 150 million tokens per month is equivalent to roughly 500 full coding sessions a day — each with a 10,000-token context. A team of 20 engineers using an AI coding assistant daily will comfortably hit that volume.

The Indian and UK builder advantage

Open-weight self-hosting is not merely a cost story. For teams in India and the United Kingdom, it has specific legal and strategic dimensions that closed APIs cannot address.

Data sovereignty under GDPR. UK teams processing personal data through a third-party AI API incur Article 28 data-processor obligations — you must sign a Data Processing Agreement, audit the vendor's security posture, and document the legal basis for the transfer. When the model runs inside your own infrastructure (whether on-premises or in a GDPR-compliant cloud region you control), the personal data never leaves your processing environment. Third-party processor obligations vanish. This is not a theoretical benefit: it is the reason multiple NHS digital contractors are actively piloting self-hosted models for anything touching patient records.

DPDP compliance for Indian teams. India's Digital Personal Data Protection Act 2023 creates similar obligations around data localisation and consent for certain categories of personal data. Building on a self-hosted model that processes all data within Indian cloud infrastructure (AWS Mumbai, Azure Central India, or IndiaAI Mission nodes) substantially simplifies your DPDP compliance posture. For fintech, healthtech, and edtech startups — sectors where DPDP enforcement is expected to tighten through 2026 — this is a meaningful risk reduction.

Cost economics at Indian GPU prices. The IndiaAI Mission GPU cluster offers A100 access at approximately ₹115–150 per GPU-hour. A four-GPU INT8 setup therefore costs ₹460–600 per hour. Running it 12 hours a day, 22 working days a month, costs ₹121,440–158,400 per month — roughly $1,450–$1,900 USD. At $5/MTok input pricing for a closed API, that monthly budget buys you 290–380 million input tokens. If your team is generating more than that, self-hosting has already crossed break-even.

Competitive moat. The broader May 2026 trend is unmistakable: open-weight model quality has converged to within 5–15 benchmark points of closed frontier models on most task categories. That gap is narrowing every quarter. Teams that invest in the infrastructure and operational expertise to run these models now are building a capability that is genuinely difficult to replicate quickly. The ability to deploy a frontier-class coding model inside your own infrastructure — with no ongoing per-token cost — is a structural advantage in any business where AI usage scales with product growth.

What it still cannot do well

We have spent considerable time making the case for DeepSeek V4 self-hosting. The case is strong. But a good deployment decision requires an honest accounting of the downsides.

Warning: Do not self-host DeepSeek V4 if your primary workload requires sub-500ms first-token latency at scale. Hosting a 1.6T MoE model introduces cold-start times and prefill latency that closed APIs — running on purpose-built, hyperscale inference infrastructure — cannot currently be matched by a team-run cluster. For real-time coding assistants in an IDE (where users expect near-instant responses), the latency profile of a self-hosted cluster may degrade the experience noticeably.

Latency versus closed APIs. OpenAI and Anthropic have invested hundreds of millions of pounds in custom inference infrastructure, speculative decoding pipelines, and network-edge serving. A four or eight GPU cluster run by an engineering team will not match their first-token latency at low concurrency. The gap is typically 300–800ms on short prompts — noticeable in interactive applications, less relevant in batch or background processing.

Cold start times. Loading DeepSeek V4's weights into GPU memory from disk takes 8–20 minutes depending on storage throughput and GPU interconnect speed. This makes serverless-style scaling — spinning up new instances on demand — impractical. You will need to maintain warm instances, which means paying for GPU time even when the model is idle. Plan your scaling strategy accordingly.

Maintenance burden. Running a production LLM cluster is a distinct engineering discipline. You will need monitoring for GPU memory fragmentation, strategies for handling OOM events gracefully, a process for applying security patches to the serving stack, and operational runbooks for common failure modes. If your team does not have that expertise today, factor in the time cost of building it.

Model updates. DeepSeek will release improved versions of V4 — and when they do, you will need to re-download, re-quantise, and re-validate the new weights before deploying. Closed API providers update their models without any action required on your side. Whether that is a feature or a bug depends on how much you value stability versus capability currency.

None of these limitations are disqualifying for the right workload profile. Batch code review, CI/CD pipeline integration, overnight codebase analysis, document processing, and legal tech applications all tolerate higher latency and benefit enormously from zero marginal cost. For those use cases, DeepSeek V4 self-hosting is, on balance, the correct choice for any team processing meaningful volume.

For a detailed comparison of DeepSeek V4's API-based cost versus GPT-4o and MMLU benchmark positioning, see our earlier cost comparison piece. This article focuses specifically on self-hosted deployment, which is a distinct decision with different economics and requirements.

The open-weight era is not coming. It is here. DeepSeek V4 is the clearest evidence yet that builders who choose to own their inference stack are not making a compromise — they are making a bet that looks increasingly well-placed with every passing quarter.

DeepSeek V4: The Open-Weight Coding Giant Builders Can Now Self-Host

What makes DeepSeek V4 different from V3

Self-hosting hardware requirements

Deployment walkthrough with vLLM

When to use DeepSeek V4 versus closed APIs

The Indian and UK builder advantage

What it still cannot do well

Building with open-weight models?

Frequently asked

Self-hosting DeepSeek V4? Share your build.