What the May 2026 board actually says
Two months ago the SWE-bench Verified leaderboard was a Claude-vs-OpenAI seesaw with both labs trading places around 80%. As of mid-May 2026, the picture has changed sharply. Anthropic's Claude Mythos Preview sits at the top with a verified pass rate of 93.9% — the first time any model has crossed 90% on the benchmark. OpenAI's GPT-5.5, released on 23 April 2026, is reported at 88.7% per marc0.dev's May snapshot and OpenAI's own materials. Anthropic's previous flagship, the Adaptive variant of Claude Opus 4.7, follows at 87.6%.
The middle of the table is where things get interesting for builders. Google's Gemini 3.1 Pro and DeepSeek's V4 Pro Max are tied at 80.6% — one closed and one open-weight, separated by orders of magnitude in licence cost. MiniMax M2.5 leads the open-weight category outright at 80.2%, while Mistral Medium 3.5 follows at 77.6% on a 128B dense architecture. Mistral Devstral 2 rounds out the public open-weight roster at 72.2% — well below the leaders, but small enough to run on a single H100 at the edge.
None of this is the headline you might be expecting. The real news is that, for the first time, an open-weight model — DeepSeek V4 Pro Max — has matched a top-three closed frontier model on the same benchmark. Indian self-hosters and UK consultancies running on sovereign infrastructure now have a defensible technical alternative to the API-only giants. Whether they should actually pick it depends on a number none of these leaderboards print: cost per passed task.
The full ranked table
Here is the May 2026 SWE-bench Verified snapshot, sorted by pass rate. Closed-source models are paid API endpoints; open-weight models can be downloaded and self-hosted. Scores are quoted from the lab-reported figures aggregated on swebench.com and marc0.dev's leaderboard tracker.
| Rank | Model | Lab | Licence | SWE-bench Verified |
|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | Closed API | 93.9% |
| 2 | GPT-5.5 (OpenAI-reported) | OpenAI | Closed API | 88.7% |
| 3 | Claude Opus 4.7 Adaptive | Anthropic | Closed API | 87.6% |
| 4 | GPT-5.3 Codex | OpenAI | Closed API | 85.0% |
| 5 | Gemini 3.1 Pro | Google DeepMind | Closed API | 80.6% |
| 5 | DeepSeek V4 Pro Max | DeepSeek | Open weights | 80.6% |
| 7 | MiniMax M2.5 | MiniMax | Open weights | 80.2% |
| 8 | Mistral Medium 3.5 | Mistral | Closed API | 77.6% |
| 9 | Mistral Devstral 2 | Mistral | Open weights | 72.2% |
A note on GPT-5.5: the 88.7% figure is the publisher-reported number from OpenAI's launch materials and has been replicated on marc0.dev's tracker. Independent reproductions on the same harness are still landing. Treat it as an upper bound until a third-party run is on the board.
These are SWE-bench Verified scores — 500 curated Python-only tasks that have been in circulation long enough to leak into model pretraining sets. The harder, multi-language SWE-bench Pro benchmark consistently produces scores 15 to 25 points lower for the same model. If you are signing a procurement decision off the back of a single number, use Pro. Read our companion piece — why Verified says 88% and Pro says 64% — before you do.
Three categories that matter more than one number
Treating SWE-bench Verified as a single ranking is the most common procurement mistake in Indian and UK build teams in 2026. The leaderboard contains at least three sub-races, and the right model for your team depends on which one you are running.
1. The closed-source frontier race
Anthropic, OpenAI and Google are competing for the absolute ceiling. Claude Mythos Preview at 93.9% is currently 5.2 points clear of GPT-5.5's reported 88.7% and 13.3 points clear of Gemini 3.1 Pro at 80.6%. For workloads where every additional point of pass rate translates directly into engineering hours saved — large-scale autonomous refactors, mission-critical regression triage, the brutal long tail of intermittent bugs — paying the API premium for the absolute best model is straightforward economics.
Anthropic also benefits from a coherent agent harness story (Claude Code, Managed Agents, the security beta) and a long-context substrate that pairs well with whole-repo prompts. OpenAI counters with mature tool-calling, deeper third-party integrations and the largest evaluation surface in the industry. Google sits behind on raw SWE-bench but ahead on multimodal — which matters when your bug fix is "the dashboard renders wrong on the third reload".
2. The open-weight race
For Indian builders worried about DPDP outbound-data risk and UK consultancies that have to keep client code inside a SOC 2 Type II perimeter, the open-weight question is not academic. DeepSeek V4 Pro Max at 80.6% is now the strongest open-weight model on Verified and is statistically indistinguishable from Gemini 3.1 Pro on the same benchmark. MiniMax M2.5 at 80.2% is right behind, and the broader cohort of four Chinese open-weight coding models that shipped in 12 days means the gap is closing on a monthly cadence.
For teams running on a constrained GPU footprint, the smaller open-weight options matter more than the headline rank. Qwen3.6-27B punches well above its parameter weight in our internal harness tests, and Devstral 2 at 72.2% is small enough to host on a single H100 — useful for an offline-development workstation in a regulated environment.
3. The cost-per-passed-task race
No leaderboard prints the number that actually matters: how much it costs to clear one bug from a real backlog. Top-of-board models often achieve their headline score by spending freely on chain-of-thought tokens, multi-attempt sampling and tool calls. A 93.9% model that burns through 200,000 output tokens per task can easily cost ten times more per passed test than an 80% model running a tight harness.
Rough back-of-envelope numbers for a 1,000-task triage workload, using public list prices and typical agent-harness token budgets:
| Model | Verified pass rate | Indicative cost per attempted task | Indicative cost per passed task |
|---|---|---|---|
| Claude Mythos Preview | 93.9% | ~$2.20 | ~$2.34 |
| GPT-5.5 | 88.7% | ~$1.80 | ~$2.03 |
| Claude Opus 4.7 Adaptive | 87.6% | ~$1.30 | ~$1.48 |
| Gemini 3.1 Pro | 80.6% | ~$0.55 | ~$0.68 |
| DeepSeek V4 Pro Max (self-hosted) | 80.6% | ~$0.08 | ~$0.10 |
| MiniMax M2.5 (self-hosted) | 80.2% | ~$0.08 | ~$0.10 |
Those are indicative numbers, not contractually tested figures — your harness, your prompt-cache hit rate and your willingness to retry will move them by 30–50% in either direction. The shape of the curve, however, is robust: an open-weight model at 80% can be 20 to 25 times cheaper per passed task than a closed frontier model at 94%. Whether you should care depends on which end of the bug distribution your team actually fights.
Run a hybrid harness. Route bug-fix attempts to a cheap open-weight model first; only escalate to Claude Mythos Preview or GPT-5.5 if the cheaper attempt fails or low-confidence. In our internal triage workloads this lifts effective pass rate to within two points of frontier while paying frontier rates on only 15–20% of tasks.
The contamination caveat — read this before you sign anything
SWE-bench Verified is 500 hand-curated GitHub issues from twelve popular Python repositories. The repositories are public, the issues are public, the patches are public, and the benchmark has been around since 2024. By May 2026, every major lab's pretraining corpus has almost certainly seen the relevant code. That does not mean models are "cheating" in a deliberate sense — but it does mean the benchmark systematically over-states real-world performance.
SWE-bench Pro, launched by Scale AI in late 2025, addresses this with 1,865 multi-language tasks (Python, JavaScript, TypeScript, Go, Rust) curated under a stricter contamination protocol. The same model that scores 88% on Verified will typically score 60–65% on Pro. The gap is not noise — it is a measurement of how much your headline number is inflated by training-data leakage. Our methodology deep-dive walks through the protocol differences in detail.
The practical implication for a build team in Bengaluru, Pune, London or Edinburgh: if you are picking a model for a Python-only codebase that resembles Django, Flask or scikit-learn, the Verified number is reasonably indicative. If your codebase is a Go microservice estate, a TypeScript monorepo or a Rust systems project, the Pro number is the one you should be quoting in the procurement memo.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →Concrete procurement scenarios
The right model depends on the workload. Two anonymised but representative builder situations we have seen in the AI Tech Connect community over the past month:
Mumbai dev shop, twelve engineers, tight unit economics
A Mumbai services firm running a fixed-price contract for a UK fintech needs to clear a backlog of 4,000 low-to-medium-complexity Python bugs across three repositories. Margin is razor thin and the data must stay inside India for client compliance. The right call: self-hosted DeepSeek V4 Pro Max at 80.6% on rented H100s in a Mumbai data centre. At roughly $0.10 per passed task, the entire backlog clears for under $500 of compute. The 13-point pass-rate gap versus Claude Mythos Preview is real but the bugs in this category are not at the long-tail end where it matters.
UK consultancy, twenty-person team, premium pricing
A London consultancy bills £1,800 per engineer per day on a financial-services modernisation programme. The bugs in scope are the genuinely hard ones — race conditions, intermittent test failures, integration mismatches between legacy Java and a new Python service layer. The right call here is Claude Mythos Preview on the API, run through Claude Code's autopilot harness, billed straight through to the client at cost-plus. The premium per task is irrelevant against the £1,800 engineer day it replaces, and the extra 13 points of pass rate translates directly into bugs the cheaper models simply cannot crack.
The honest version of the leaderboard is therefore not a single ranking but a decision tree. What's your bug distribution? What's your data-residency constraint? What's the cost per failed task in your business — a £20 engineering hour, or a £200,000 production outage? Plug those into the table above and the answer falls out.
What to watch over the next sixty days
Three things will move this leaderboard between now and the end of July 2026:
- GPT-5.5 third-party verification. OpenAI's 88.7% figure will get reproduced independently within weeks. If it lands in the 86–87% range, the gap to Claude Opus 4.7 Adaptive narrows further and the order may shuffle.
- The next open-weight drop. The cadence of Chinese open-weight coding models in April and May suggests another credible 80%+ open-weight release is likely before the end of June. The first one to cross 85% on Verified — or 70% on Pro — changes the procurement maths for every self-hosting team.
- SWE-bench Pro adoption. If labs start publishing Pro numbers alongside Verified by default — Anthropic has signalled they will for Mythos — the headline-grabbing 90%+ scores become harder to sustain. That is healthy; the rest of the industry will follow.
For now: Mythos Preview leads, GPT-5.5 is the credible challenger, and the open-weight pack has caught up to Google. The "one true model" framing is over. The teams getting the most value from this generation of coding agents are the ones running two or three in a routed harness, treating the leaderboard as a price list, not a championship.
Primary sources used for this article: swebench.com, marc0.dev May 2026 leaderboard, codesota.com agentic browser, Scale's SWE-bench Pro public leaderboard, and llm-stats.com benchmarks.