What the gap actually tells you
If you have spent any time inside a procurement conversation in the last twelve months, you have seen a slide with a SWE-bench score on it. The number is almost always SWE-bench Verified, almost always close to or above 80%, and almost always presented as if it predicts how the model will behave on your codebase. It does not. The April 2026 leaderboard for SWE-bench Pro — the harder, larger, less gameable variant — has Claude Opus 4.7 at the top with 64.3%. The same model scores 87.6% on Verified. That is a 23.3 percentage-point gap on the same model, the same week, on benchmarks built from the same underlying GitHub corpus.
The gap is not a bug. It is the most informative single data point a builder can carry into a vendor evaluation, because it tells you how much of the headline score is real general code reasoning and how much is scaffold engineering, prompt patterns and overfitting to a 500-instance evaluation set the entire industry has been optimising against for two years.
When a vendor quotes a SWE-bench number, ask three follow-ups. Which variant — Verified, Pro, Multilingual or Lite? What scaffold and tool-use harness produced the score? Is the number from the public leaderboard or an internal evaluation? If they cannot answer all three crisply, treat the number as marketing.
The two benchmarks, side by side
Both benchmarks pull from real GitHub issues paired with the human-written test patches that resolve them. The model has to produce a code change that turns the failing tests green. The differences are in scale, filtering and how easily a clever scaffold can game the format.
| Dimension | SWE-bench Verified | SWE-bench Pro |
|---|---|---|
| Dataset size | 500 instances | Larger, harder set hosted by Scale AI |
| Filtering | Hand-filtered by humans in collaboration with OpenAI to ensure clear problem descriptions, correct test patches and solvable tasks | Less aggressive filtering — closer to the raw distribution of real bug reports |
| Gameability | High — two years of public optimisation has shaped scaffolds and prompts to its patterns | Lower — newer, larger, harder to overfit at scale |
| Top score (April 2026) | 87.6% — Claude Opus 4.7 | 64.3% — Claude Opus 4.7 |
| Best used as | Coarse 'is this model in the right league' signal | More honest indicator of frontier code-reasoning capability |
Why the gap exists — three reasons
The 23-point delta is not random noise. It decomposes into three structural causes that every builder should be able to recite.
- Test-suite filtering. Verified was hand-filtered to remove tickets where the test patch was wrong, the description was unclear, or the task was unsolvable in the snapshot. That filtering removes a category of failure that real-world tickets routinely contain — ambiguous repro steps, flaky tests, environment-dependent assertions. Pro keeps more of those, so a model that depends on tidy problem statements degrades.
- Scaffold overfitting. A chunk of benchmark progress in 2024-2025 was scaffolding and prompt engineering specific to Verified's patterns, not general improvement in code reasoning. That is a direct quote of the framing in the morphllm coverage of SWE-bench Pro, and it is the single most uncomfortable sentence in modern code-LLM marketing. Pro breaks scaffolds that were tuned to the Verified instance distribution.
- Distribution drift. Verified is a fixed slice of mostly Python repos from a defined snapshot of GitHub. Pro stretches that distribution. Models that learnt the shape of Django bug fixes do less well on the longer tail.
If a model's Verified-to-Pro gap is materially wider than the field average, that is a signal of scaffold overfitting — not a signal of weak underlying reasoning. Conversely, a model with a smaller gap may have less polished tooling rather than better fundamentals. Read the gap, not just the headline.
Verified leaderboard: the April 2026 snapshot
The full Verified leaderboard is the easiest place to see how compressed the top ten has become. Ten frontier models sit inside a ten-point band, which means the marginal cost of a one-point gain has become enormous — and almost certainly does not translate one-for-one into production behaviour.
| Rank | Model | SWE-bench Verified |
|---|---|---|
| 1 | Claude Opus 4.7 (released 16 April 2026) | 87.6% |
| 2 | GPT-5.3 Codex | 85.0% |
| 3 | Claude Opus 4.5 | 80.9% |
| 4 | Claude Opus 4.6 | 80.8% |
| 5 | Gemini 3.1 Pro | 80.6% |
| 6 | MiniMax M2.5 (best open-weight on Verified) | 80.2% |
| 7 | GPT-5.2 | 80.0% |
| 8 | Qwen 3.6 Plus | 78.8% |
| 9 | MiMo-V2-Pro (Xiaomi, 1T parameters) | 78.0% |
| 10 | GLM-5 (Zhipu, 744B parameters) | 77.8% |
For comparison, Mistral Devstral 2 — a developer-focused open-weight release — scores 72.2% on the same Verified benchmark, which is a reminder that the headline gap between top-tier proprietary and the strongest open-weight tier is much smaller than vendor decks usually claim. We covered the broader open-weight picture in the Claude Opus 4.7 long-context piece, and the practical agent-vs-agent comparison sits in the Cursor Composer 2 vs Claude Code piece.
What 'gameability' looks like in practice
'Gameable' is one of those words that benchmark people throw around without unpacking. Concretely, here is what it means inside a SWE-bench evaluation harness.
- File-localisation shortcuts. Verified instances tend to have a small number of files where the patch lives. A scaffold that aggressively narrows attention to a few candidate files lifts the score without improving the underlying reasoning. Pro distributions punish that shortcut.
- Test-aware prompting. Some scaffolds quietly include the failing test in the prompt, turning a debug task into a fill-in-the-blank task. Verified's clean test patches make that especially effective.
- Retry-and-vote loops. Running the same task 32 times and voting the dominant patch is cheap on a 500-instance benchmark and outright cheating on a deployment, but it shows up as a Verified bump.
- Repository-specific finetuning. When a benchmark is fixed, you can over-train on its repo style. Pro is larger and harder to over-train against, by construction.
None of these tactics make a model better at solving your tickets. They make it better at solving the 500 tickets in Verified. The Pro number strips most of that polish away.
How to read benchmark numbers in a pitch deck or contract
This is the section to forward to whoever is signing the cheque. Whether the deal is a Mumbai studio quoting an enterprise client in Singapore, or a London consultancy putting a vendor through a procurement gate at a UK insurer, the rules are the same.
- Treat Verified as a floor, not a forecast. A model below 70% Verified is unlikely to do well on your repo. A model above 80% Verified might. The score does not differentiate well at the top.
- Always ask for the Pro number. If the vendor cannot produce one, that is information. If the gap is wider than 25 points, that is also information.
- Specify the harness in the contract. 'The model under SLA is Claude Opus 4.7 at the official Anthropic API endpoint, evaluated with the openhands scaffold version X' is a contractable sentence. 'The model scores 87.6% on SWE-bench' is a marketing sentence.
- Build a bridge clause. Tie any productivity claim to a custom evaluation on a slice of your real backlog. Twenty closed PRs from the last quarter is enough to embarrass the worst vendors and reward the honest ones.
- Be honest in your own deck. If you are a UK or Indian builder pitching to international clients, quote the Pro number and explain the gap. Sophisticated buyers reward technical honesty; the unsophisticated ones will be confused either way.
A pattern AI Tech Connect has watched repeat across UK and Indian procurement cycles: the bid quoting 88% Verified wins the first round, the pilot underperforms within six to eight weeks, and the rebid goes to the team that quoted the Pro number honestly — often at a higher rate. The honest number is the long-game number, but it costs you the first round.
— AI Tech Connect editorialWhat benchmark would actually predict how a model handles your codebase?
The honest answer is none of the public ones. Verified, Pro, Multilingual and Lite are all useful as triage filters and as signals of frontier movement, but none of them sees your authentication middleware, your bespoke build tooling, the half-deprecated ORM you cannot replace this quarter, or the way your senior engineers actually write tickets. A custom evaluation of twenty real closed pull requests from your last six months — diff redacted, issue text as input — is the single highest-leverage hour an engineering lead can spend in a model selection.
For a related comparison of how scope changes a benchmark conversation entirely, Cursor Composer 2 reports 73.7% on SWE-bench Multilingual, which is a different evaluation set and not directly comparable to a Verified score on a single-language slice. Cross-benchmark comparisons are the second-most common error in vendor decks, after quoting Verified as a productivity prediction.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →The takeaway for buyers and Builders
The 23-point gap between Claude Opus 4.7 on Verified and Pro is not a problem to be solved. It is a measurement that finally makes the benchmark conversation honest. If your work involves quoting numbers — in a pitch, in a procurement document, in an SLA — quote the Pro number when you can, explain the gap when you have to, and build a bridge clause to a custom evaluation in every contract that hinges on a productivity claim. Buyers are getting more sophisticated, slowly. Builders who treat them as if they are will win the deals worth winning.
Primary sources for this piece: the official SWE-bench leaderboard at swebench.com and the Verified split; aggregator views at llm-stats.com and benchlm.ai; and the SWE-bench Pro framing and leaderboard at morphllm.com/swe-bench-pro and labs.scale.com/leaderboard/swe_bench_pro_public.