The headline number and why it matters
On the SWE-Bench Verified leaderboard, DeepSeek V4 Pro now sits at 80.6% — clear of every other open-weight model and ahead of all the closed-source frontier coders shipping in May 2026. Claude Sonnet 4 sits at roughly 77.2%, GPT-5 at about 74.9%, Gemini 2.5 around 71.8%. On GPQA Diamond, V4 Pro scores 90.1, putting it within striking distance of the top closed-source reasoning models. And it runs on a 1M-token context window. You can download the weights today, point them at your own GPU cluster, and never send a single byte of source code to a third-party API.
That last sentence is the whole article, really. The benchmark lead will move within weeks — open-weight models leapfrog each other constantly, and Llama 4, Qwen 3.5, Gemma 4 and Mistral Medium 3.5 are all close behind. What is structurally new is that the best coding model in May 2026, by the only benchmark that still discriminates between frontier coders, is the one you can self-host.
HumanEval is finished as a benchmark. Top models cluster above 97%, which makes it noise. SWE-Bench Verified — patches against real bugs in real open-source repos, scored by whether the generated patch actually passes the upstream test suite — is the meaningful coding benchmark today. And on that benchmark, the leader is open.
- 80.6% on SWE-Bench Verified — the highest open-weight score on record and ahead of Claude Sonnet 4, GPT-5 and Gemini 2.5.
- 90.1 on GPQA Diamond — V4 Pro is not just a coding specialist; the reasoning generalises.
- 1M-token context window — entire monorepos fit in a single call, the same envelope as Claude Opus 4.7.
- Open-weight — weights downloadable under DeepSeek's commercial-use licence; training data and recipe are not.
- Part of the V4 family — V4 Flash is the lightweight variant; V4 Pro is the highest-capability tier in the same release wave.
For context on the V4 release as a whole, see our earlier piece on DeepSeek V4 and the LiveCodeBench numbers. This article is about V4 Pro specifically — the Pro variant is what changes the self-host economics conversation.
Before you spin up an H100 cluster, run V4 Pro against your own bug backlog for a week. Public benchmarks measure public benchmarks. A 3-percentage-point lead on SWE-Bench Verified can disappear or invert on your private codebase if your stack leans heavily on a niche framework or an in-house DSL. The economics only make sense if the accuracy holds on your work.
What's different about V4 Pro versus V4 base and V4 Flash
The DeepSeek V4 family is three models, not one, and confusing them costs money. The base V4 model is the workhorse — strong coder, reasonable reasoning, comparatively modest GPU footprint. V4 Flash is the lightweight variant aimed at high-QPS inference workloads where latency dominates and you accept a measurable accuracy drop in exchange for throughput. V4 Pro is the heavyweight: more parameters, longer effective reasoning chain, and the only tier that hits the 80.6% SWE-Bench number.
The practical decision tree:
- V4 Flash — autocomplete, inline edit prediction, fast unit-test generation, anything where you need responses in under 500ms and the task is well-bounded.
- V4 base — the daily-driver coding model for most teams. Good SWE-Bench performance (notably below Pro, but still in the open-weight top tier), much cheaper to host.
- V4 Pro — the model you reach for on multi-file refactors, hard production bugs, anything that benefits from the full 1M-token context and the longer reasoning trace.
You will almost certainly want to route between them rather than pay V4 Pro's costs for every keystroke. The same logic Cursor's routers apply to Claude Sonnet versus Opus applies here.
Self-host GPU footprint — what does Pro actually cost to run?
This is where the open-weight conversation either holds together or falls apart. Open weights are not free; they are differently expensive. You stop paying per-token fees to a US-hosted API and start paying for GPU hours, ops engineering and the salary of someone who knows how to keep an inference cluster up.
For V4 Pro at production throughput with the full 1M-token context, the practical floor is a multi-GPU H100 or B200 cluster. The math changes substantially with NVIDIA's B200 generation — see our deep-dive on B200 inference economics versus H100 for the underlying numbers. Roughly:
| Hardware | Workload profile | Approx. on-demand cost | Suitable for V4 Pro? |
|---|---|---|---|
| 4x H100 80GB | Dev / low-QPS, 128k context | ~$12-18/hr | Yes, with context limits |
| 8x H100 80GB | Production, full 1M context | ~$24-36/hr | Yes, practical floor |
| 4x B200 | Production, full 1M context | ~$28-40/hr | Yes, smaller footprint |
| IndiaAI Mission H100 pool | Indian sovereign GPU | ~150 INR/GPU-hr (subsidised) | Yes — best price/sovereignty mix for IN teams |
| UK AISI / Isambard-AI | UK sovereign GPU for research | Allocation-based | Yes for eligible workloads |
| 2x H100 80GB | Quantised V4 Pro, dev only | ~$6-9/hr | Marginal — accuracy drops |
The numbers above are ballpark on-demand rates from major Indian and UK cloud providers in May 2026. Reserved capacity drops them substantially. A team running a steady production workload typically pays 30-50% less than the on-demand sticker.
Cost-per-million-tokens — the only comparison that matters
Here is the headline that gets emailed up the chain. The closed-source frontier coders charge roughly:
| Model | Input $/MTok | Output $/MTok | Effective $/MTok (typical mix) |
|---|---|---|---|
| Claude Sonnet 4 API | ~$3 | ~$15 | ~$5-7 |
| GPT-5 API | ~$2.50 | ~$12 | ~$4-6 |
| Claude Opus 4.7 API | ~$5 | ~$25 | ~$9-12 |
| DeepSeek V4 Pro (self-host, 8x H100) | — | — | ~$0.40-1.20 (amortised) |
| DeepSeek V4 Pro (IndiaAI Mission) | — | — | ~$0.10-0.30 (amortised) |
The self-host numbers assume a reasonably utilised cluster — at least 60-70% of GPU-hours actually serving requests. Run a half-empty cluster and the per-token cost balloons because you are paying for idle silicon. This is the trap teams fall into: they price-compare against the headline rate and end up with worse unit economics than the API they were trying to escape.
The honest read: if your workload sustains enough throughput, self-hosted V4 Pro is 5-20x cheaper per resolved ticket than Claude Sonnet 4 API. If your workload is spiky and modest, the API is still cheaper and a lot less work.
Every 1M-context model has the same issue: agentic edits past roughly 300k tokens drift, regardless of how high the benchmark score sits. V4 Pro is not exempt. If you build a long-horizon agentic loop where the model both reads and writes against a 600k-token working set, expect to hit the same coherence ceiling we documented for Claude Opus 4.7's 1M window. Cap your agentic loops below the threshold, or use RAG to keep the read context manageable.
The data-residency lever — why this matters for IN and UK builders
For an Indian fintech under RBI's data-localisation guidance, or a UK NHS trust under the DPA and the broader public-sector AI procurement rules, the cost-per-token conversation is secondary. The first question is: where does our source code go when an engineer asks the model to refactor it?
If the answer is "to a US-hosted API run by a foreign cloud provider," half the use cases die in compliance review. The remaining ones live behind a slow legal process that kills velocity. The open-weight option is the structural fix. V4 Pro running on the IndiaAI Mission H100 pool, or on a sovereign UK GPU allocation, or even on private on-prem hardware, means the source code never leaves the jurisdiction. That is not a cost optimisation; that is a categorical change in what projects you can actually ship.
The China-origin question deserves a straight answer. DeepSeek is a Chinese AI lab. The weights are downloadable and you run them on hardware you control — no telemetry, no API calls back to Beijing, no live inference dependency on a Chinese provider. That said, some procurement teams will block any model with a Chinese origin regardless of self-hosting. If that is your environment, your options narrow to Llama 4, Qwen 3.5 (also Chinese — usually the same block applies), Gemma 4 and Mistral Medium 3.5. See our piece on Mistral Medium 3.5's SWE-Bench performance for the European-origin alternative, and the Qwen 3.5 27B coding agent coverage for the closest direct comparator.
The agentic gap — name it honestly
Here is the part the leaderboard does not show. SWE-Bench Verified scores a single patch against a single test suite. It does not measure how a model behaves over a 40-step agentic loop that has to read, reason, edit, run tests, observe failures and iterate. On that dimension — the actual job of a coding agent — Claude Sonnet 4 inside Claude Code, or Claude Opus 4.7 inside an autopilot loop, is still measurably better than V4 Pro inside the best available open-source harness.
The gap is closing. Open-source agent harnesses like SWE-agent and OpenHands are improving fast. But in May 2026, the closed-source agent loops are more polished, more reliable and require less babysitting. Our coding agent leaderboard from earlier this month walks through the practical gap in detail.
The pragmatic split for most teams: self-hosted V4 Pro for batch refactor jobs and bulk patch generation; Claude Code with Sonnet 4 for interactive agentic work. The economics of the first job dominate (lots of tokens, predictable workload); the economics of the second job are dominated by engineer time, not token cost. Optimise each for what it actually consumes.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written or vetted by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →Open-weight is not no-risk
Three caveats to bake into your evaluation:
- Licence terms shift. DeepSeek's commercial-use licence has been amended more than once. Get the version that was current when you downloaded the weights reviewed by counsel, and re-review when you upgrade. Open-weight licences are not as battle-tested as the OSI-approved software licences your legal team is used to.
- Supply-chain risk. You are downloading multi-gigabyte weight files from a third party. Verify the checksums. Pin the version. Treat the weights as a build artefact with the same provenance discipline you apply to a Docker image.
- Fine-tuning expertise. Self-hosting is the easy part. Fine-tuning V4 Pro on your private codebase to recover the niche performance the base model loses is where the real engineering investment lives. Budget for an ML engineer on that work, or accept the base model's accuracy on your stack.
The broader shift — Chinese open-weight models setting the pace on coding benchmarks — is itself worth tracking. See our analysis of the May 2026 Chinese open-weight coding model wave for the geopolitical and competitive backdrop.
When V4 Pro is the right call
- You have a steady, predictable inference workload that justifies a dedicated GPU cluster.
- Data residency or source-code sovereignty is a hard requirement, not a preference.
- Your primary use case is single-shot patch generation, code review or refactor — not long-horizon agentic loops.
- Your team has, or can hire, the ops and ML engineering depth to run an inference cluster and (eventually) fine-tune.
- You are running batch jobs where the cost-per-token math favours amortised GPU hours over per-call API fees.
When it isn't
- Your workload is spiky and modest — the API providers will be cheaper end-to-end once you account for ops time.
- Your primary loop is interactive agentic coding — Claude Code with Sonnet 4 is still the more reliable end-to-end experience.
- Your procurement environment blocks Chinese-origin AI models regardless of self-hosting.
- You do not have the engineering capacity to keep an inference cluster healthy — running V4 Pro badly is worse than not running it at all.
The bottom line
DeepSeek V4 Pro is the first open-weight model where the headline benchmark genuinely beats the closed-source competition on coding. That is a structural shift, not a marginal one. For Indian and UK teams under data-residency pressure, it changes which projects you can ship at all. For everyone else, it puts real downward pressure on the closed-source API pricing — Anthropic and OpenAI now have to justify the price premium with capabilities (agentic tool-use, harness polish) rather than raw model quality.
The benchmark lead will move. Llama 4 is close. Qwen 3.5 is close. By June, V4 Pro may not be the top open-weight scorer any more. The deeper trend — open-weight coding models matching and exceeding closed-source on the benchmarks that matter — is the part to plan around. Build your stack so that the model is a swap, not a rewrite.
Source data: codersera's open-source LLM roundup 2026.