What 72.2% actually buys you
SWE-bench Verified is a 500-instance subset of the original SWE-bench, hand-filtered in collaboration with OpenAI to remove broken or under-specified tasks. Each instance is a real GitHub issue paired with the exact unit test that the human fix had to pass. A model's score is the share of issues where its proposed patch makes the hidden test pass without breaking the ones already there. It is the closest the field has to a credible apples-to-apples coding-agent leaderboard.
So what does Mistral's 72.2% on Verified actually mean for a builder picking a model on a Tuesday morning? It means roughly seven of every ten well-scoped issues land a clean patch on the first try when the agent has access to the repo and a test runner. The other three either propose something that fails the hidden test, edits the wrong file, or partially solves the problem and needs human cleanup. Compared to last year's open-weight coders — many of which sat in the high 30s and low 40s — that is a meaningful jump.
Important caveat: SWE-bench Verified is Python-heavy, drawn from a known pool of OSS repositories, and skewed towards bug-fix tickets rather than greenfield work. A 72.2% on Verified does not translate directly into a 72.2% pass rate on your TypeScript microservice or your in-house Java monolith. It is a relative ranking signal, not an absolute one.
Build a 30-50 task internal eval before you commit to any model. Pull real closed PRs from the last quarter, strip the human fix, run the candidate model in a sandbox, and check whether the existing tests pass. Any model — open or closed — that beats 60% on your internal eval is shippable.
How Mistral got here: Magistral, Pixtral, Devstral, then Mistral Small 4
Mistral's path to 72.2% is worth understanding because it tells you something about where the lab is heading. Through 2025 the company shipped three specialist open-weight models: Magistral for reasoning, Pixtral for multimodal inputs, and Devstral for coding. Each was a focused model trained for one job. The strategy was clear: rather than chase one frontier model that does everything, ship narrow models that punch above their weight in a single domain.
In March 2026 Mistral pulled the three together. Mistral Small 4 is a 128-expert mixture-of-experts that unifies reasoning, multimodal and coding under a single set of weights, with a configurable reasoning-effort knob that lets you trade latency for thinking depth. That release signalled that the specialist era was ending — or at least graduating into a single multi-skilled model.
Devstral 2 is the holdout. It is a 123-billion-parameter fully dense model — explicitly not a mixture-of-experts — positioned as the lab's coding specialist for teams that want a single-purpose model for agentic software engineering. The dense architecture matters: every parameter activates on every token, which is more expensive at inference but tends to produce more consistent behaviour on long agentic loops than sparse MoE designs of similar nominal size.
The leaderboard table
Here is where Devstral 2 actually sits on the public SWE-bench Verified leaderboard, ordered by score. The "open" column shows whether weights are publicly downloadable or whether you must access the model through a proprietary API.
| Rank | Model | SWE-bench Verified | Open or closed |
|---|---|---|---|
| 1 | Claude Opus 4.7 | 87.6% | Closed |
| 2 | GPT-5.3 Codex | 85.0% | Closed |
| 3 | Claude Opus 4.5 | 80.9% | Closed |
| 4 | Claude Opus 4.6 | 80.8% | Closed |
| 5 | Gemini 3.1 Pro | 80.6% | Closed |
| 6 | MiniMax M2.5 | 80.2% | Open-weight |
| 7 | Qwen 3.6 Plus | 78.8% | Closed |
| 8 | MiMo-V2-Pro (Xiaomi, 1T) | 78.0% | Open-weight |
| 9 | GLM-5 (Zhipu, 744B) | 77.8% | Open-weight |
| 10 | Mistral Devstral 2 | 72.2% | Open-weight |
Two things jump out. First, the open-weight cluster is genuinely competitive now — MiniMax M2.5 at 80.2% is within a single benchmark point of Gemini 3.1 Pro. Second, Devstral 2 at 72.2% is the smallest model in that open-weight cluster by some distance: 123B dense parameters versus the 744B-1T scale of GLM-5 and MiMo-V2-Pro. For inference economics, that gap is the story.
If you are sizing a self-hosted coding agent for moderate volumes — say, a fifty-engineer organisation running internal tooling — Devstral 2's parameter count is the sweet spot. The 7-point gap to MiniMax M2.5 is real but the hardware bill to serve a 123B dense model is roughly a third of what you pay to host a 744B sparse model at comparable throughput.
Do not pick a coding model purely on the SWE-bench Verified rank order. The top-line number tells you very little about how the model behaves on your stack, your test runner, your commit conventions, or the specific shape of bugs your codebase produces. Always run a local eval set first.
Self-hosting the gap: what 72.2% lets you build today
If you are deciding between Devstral 2 self-hosted and Opus 4.7 via API, the framing should not be "is the open one good enough?" It should be "for which of my workloads is 72.2% the rational choice?" There are three places where the answer is clearly yes.
Internal developer tooling. Auto-generated boilerplate, dependency upgrades, lint-fix bots, codemod runners. These are repetitive, well-scoped, and tolerant of a 5-10% retry rate. Self-hosting Devstral 2 means you can run unlimited inference for a fixed monthly hardware bill, which dominates API costs once your fleet is busy.
Customer-data-adjacent agents. Anywhere your coding agent needs to read or operate on customer data — an analytics workbench, a customer-support code generator, a regulated-sector deployment tool — self-hosting removes the cross-border data flow entirely. The compliance maths shifts in your favour.
High-volume CI assistance. PR review bots, test-generation pipelines, security-scan triage. The unit economics of running these on an external API at scale are punishing; the same workloads fit comfortably on a single inference node when self-hosted.
From conversations with engineering teams trialling self-hosted Devstral 2 for PR-review bots since March: typical first-pass quality dips of around 8% versus a closed API are measurable but not catastrophic, while monthly inference bills routinely drop by 70% or more. For workflows where a senior engineer reviews the bot's review anyway, that trade is easy to make.
— AI Tech Connect editorialWhy this matters for EU GPAI and India DPDP compliance
Self-hosting an open-weight coding model is not just an economics decision. It is increasingly a regulatory one, and the calculus differs between UK and Indian builders even though the underlying instinct — keep data local — is the same.
UK and EU builders under GPAI
The EU AI Act's general-purpose AI (GPAI) rules apply from August 2026, and any organisation deploying a GPAI system in the EU market inherits transparency, documentation and risk-management obligations. UK builders selling into the EU — which is most of them — have to comply regardless of where they are physically headquartered. Using a closed-source API means you depend on the upstream provider's GPAI documentation and updates; if Anthropic or OpenAI changes their model card, your compliance posture moves with it. Self-hosting an open-weight model means you control the version, the safety evaluations and the technical documentation. That is administratively heavier but legally cleaner.
It also reduces cross-border data flows. When your coding agent processes EU citizen data, sending that data to a US-hosted API triggers a chain of standard contractual clauses, transfer impact assessments and supplementary measures. A self-hosted model running in an EU data centre — Frankfurt, Paris, Dublin, London — sidesteps that entire workstream.
Indian builders under DPDP
India's Digital Personal Data Protection Act applies a different lens to the same problem. DPDP allows international data transfers but the central government retains the power to restrict transfers to specified countries. The safest posture for any business processing personal data of Indian residents is to keep the processing in-country wherever feasible. A self-hosted Devstral 2 running on AWS Mumbai, Yotta, CtrlS or any other Indian cloud sidesteps the cross-border question entirely. It also gives you a defensible answer to the data-fiduciary obligation around purpose limitation: you can guarantee, with audit logs, that the model was not used to train a foreign provider's next-generation system.
For Indian builders selling to UK or EU customers — and there are many — running the same self-hosted Devstral 2 in an EU region as well gives you a clean dual-region posture: one weight set, two deployments, no cross-border processing in either direction.
Self-hosting does not automatically make you compliant. Under both EU GPAI and DPDP you still owe risk assessments, security controls, incident response and — crucially — the ability to show a regulator your model evaluation trail. Do not treat "we self-host" as a substitute for the documentation work; treat it as a way to make that work tractable.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →When the closed-source 87.6% is still worth the API bill
It would be dishonest to leave the leaderboard hanging without addressing the elephant: Claude Opus 4.7 sits at 87.6% on SWE-bench Verified. That is a 15.4-point lead over Devstral 2. Where does that gap actually bite?
It bites hardest on three workloads. Architectural changes — refactors that span many files, threading a new parameter through deep call stacks, extracting shared abstractions — reward a model that holds context cleanly across a long agentic loop. The closed leaders, particularly Opus 4.7 with its 1M-token window, do this noticeably better. Ambiguous bugs — the kind where the failing test is correct but the right fix is to change a different file entirely — reward stronger reasoning. And greenfield work from a sparse spec is a context-and-priors game where the closed models still lead.
For a small team where one engineer's time recovered from a stuck PR is worth more than a month of API bills, paying for the closed leader is rational. For a large engineering organisation running thousands of agentic loops a day on routine tasks, the maths flips. Most teams should run both — Devstral 2 for the high-volume, well-scoped lane, and a closed model on call for the ambiguous lane. Our deeper take on routing between coding agents lives in Cursor Composer 2 vs Claude Code, and the broader open-weight landscape is in Open-weight finally caught up.
For verified Builders shipping production AI — UK and Indian alike — the rational stance is plural. Pick the model per workload, not per organisation. List the workload classes your agents handle, set a quality bar for each, and route accordingly. Browse Verified Builders who have shipped this routing pattern, or add your own profile if you have your own tale to tell.
Sources: SWE-bench Verified leaderboard at llm-stats.com, official benchmark documentation at swebench.com/verified.html, and Mistral release coverage on Medium.