Why don't all labs report SWE-bench Pro?

Pro is harder, larger, and less gameable than Verified, so the headline number is lower. Labs marketing a model upgrade prefer the Verified score because it looks better in a launch post. Pro is also newer — many lab evaluation pipelines were built around Verified and have not been re-tooled. Expect Pro reporting to become standard over the next two release cycles, but for now the absence of a Pro number is mostly a marketing choice rather than a technical one.

Is 87.6% on Verified meaningful for my codebase?

Only loosely. Verified is 500 hand-filtered Python issues from a fixed slice of GitHub history, with clear problem descriptions and correct test patches. Your codebase is almost certainly not Python-only, has fuzzier tickets, flakier tests and bespoke build tooling. Treat Verified as a coarse 'is this model in the right league' filter — not a prediction of how the agent handles your repo. The 23-point Pro-vs-Verified gap is your reminder that headline numbers compress under realistic conditions.

Should I trust SWE-bench numbers in vendor pitches?

Trust them as directional, not as a procurement criterion. Ask which benchmark variant the number is from (Verified, Pro, Multilingual, Lite), what scaffold and tool-use setup produced it, and whether the number is from the public leaderboard or an internal report. A vendor that quotes a Verified score as if it predicts your team's productivity is either not technically careful or is deliberately glossing over the gap. Push for a paid pilot on a slice of your real backlog before signing anything.

What's a better benchmark for my team's evaluation?

Your own. Pull twenty closed pull requests from the last six months, redact the diff, and ask the candidate model to recreate them with the issue description as input. Score on diff overlap, test-suite pass, and whether a senior reviewer would merge the output. This is unglamorous and specific to you, which is exactly why it predicts production behaviour better than any public leaderboard. Budget a day of engineering time for a custom eval — it's the single highest-leverage decision in a model selection.

SWE-bench Verified vs Pro: why one says 88% and the other says 64%

What the gap actually tells you

If you have spent any time inside a procurement conversation in the last twelve months, you have seen a slide with a SWE-bench score on it. The number is almost always SWE-bench Verified, almost always close to or above 80%, and almost always presented as if it predicts how the model will behave on your codebase. It does not. The April 2026 leaderboard for SWE-bench Pro — the harder, larger, less gameable variant — has Claude Opus 4.7 at the top with 64.3%. The same model scores 87.6% on Verified. That is a 23.3 percentage-point gap on the same model, the same week, on benchmarks built from the same underlying GitHub corpus.

The gap is not a bug. It is the most informative single data point a builder can carry into a vendor evaluation, because it tells you how much of the headline score is real general code reasoning and how much is scaffold engineering, prompt patterns and overfitting to a 500-instance evaluation set the entire industry has been optimising against for two years.

Pro tip

When a vendor quotes a SWE-bench number, ask three follow-ups. Which variant — Verified, Pro, Multilingual or Lite? What scaffold and tool-use harness produced the score? Is the number from the public leaderboard or an internal evaluation? If they cannot answer all three crisply, treat the number as marketing.

The two benchmarks, side by side

Both benchmarks pull from real GitHub issues paired with the human-written test patches that resolve them. The model has to produce a code change that turns the failing tests green. The differences are in scale, filtering and how easily a clever scaffold can game the format.

Dimension	SWE-bench Verified	SWE-bench Pro
Dataset size	500 instances	Larger, harder set hosted by Scale AI
Filtering	Hand-filtered by humans in collaboration with OpenAI to ensure clear problem descriptions, correct test patches and solvable tasks	Less aggressive filtering — closer to the raw distribution of real bug reports
Gameability	High — two years of public optimisation has shaped scaffolds and prompts to its patterns	Lower — newer, larger, harder to overfit at scale
Top score (April 2026)	87.6% — Claude Opus 4.7	64.3% — Claude Opus 4.7
Best used as	Coarse 'is this model in the right league' signal	More honest indicator of frontier code-reasoning capability

Why the gap exists — three reasons

The 23-point delta is not random noise. It decomposes into three structural causes that every builder should be able to recite.

Test-suite filtering. Verified was hand-filtered to remove tickets where the test patch was wrong, the description was unclear, or the task was unsolvable in the snapshot. That filtering removes a category of failure that real-world tickets routinely contain — ambiguous repro steps, flaky tests, environment-dependent assertions. Pro keeps more of those, so a model that depends on tidy problem statements degrades.
Scaffold overfitting. A chunk of benchmark progress in 2024-2025 was scaffolding and prompt engineering specific to Verified's patterns, not general improvement in code reasoning. That is a direct quote of the framing in the morphllm coverage of SWE-bench Pro, and it is the single most uncomfortable sentence in modern code-LLM marketing. Pro breaks scaffolds that were tuned to the Verified instance distribution.
Distribution drift. Verified is a fixed slice of mostly Python repos from a defined snapshot of GitHub. Pro stretches that distribution. Models that learnt the shape of Django bug fixes do less well on the longer tail.

Watch out

If a model's Verified-to-Pro gap is materially wider than the field average, that is a signal of scaffold overfitting — not a signal of weak underlying reasoning. Conversely, a model with a smaller gap may have less polished tooling rather than better fundamentals. Read the gap, not just the headline.

Verified leaderboard: the April 2026 snapshot

The full Verified leaderboard is the easiest place to see how compressed the top ten has become. Ten frontier models sit inside a ten-point band, which means the marginal cost of a one-point gain has become enormous — and almost certainly does not translate one-for-one into production behaviour.

Rank	Model	SWE-bench Verified
1	Claude Opus 4.7 (released 16 April 2026)	87.6%
2	GPT-5.3 Codex	85.0%
3	Claude Opus 4.5	80.9%
4	Claude Opus 4.6	80.8%
5	Gemini 3.1 Pro	80.6%
6	MiniMax M2.5 (best open-weight on Verified)	80.2%
7	GPT-5.2	80.0%
8	Qwen 3.6 Plus	78.8%
9	MiMo-V2-Pro (Xiaomi, 1T parameters)	78.0%
10	GLM-5 (Zhipu, 744B parameters)	77.8%

For comparison, Mistral Devstral 2 — a developer-focused open-weight release — scores 72.2% on the same Verified benchmark, which is a reminder that the headline gap between top-tier proprietary and the strongest open-weight tier is much smaller than vendor decks usually claim. We covered the broader open-weight picture in the Claude Opus 4.7 long-context piece, and the practical agent-vs-agent comparison sits in the Cursor Composer 2 vs Claude Code piece.

What 'gameability' looks like in practice

'Gameable' is one of those words that benchmark people throw around without unpacking. Concretely, here is what it means inside a SWE-bench evaluation harness.

File-localisation shortcuts. Verified instances tend to have a small number of files where the patch lives. A scaffold that aggressively narrows attention to a few candidate files lifts the score without improving the underlying reasoning. Pro distributions punish that shortcut.
Test-aware prompting. Some scaffolds quietly include the failing test in the prompt, turning a debug task into a fill-in-the-blank task. Verified's clean test patches make that especially effective.
Retry-and-vote loops. Running the same task 32 times and voting the dominant patch is cheap on a 500-instance benchmark and outright cheating on a deployment, but it shows up as a Verified bump.
Repository-specific finetuning. When a benchmark is fixed, you can over-train on its repo style. Pro is larger and harder to over-train against, by construction.

None of these tactics make a model better at solving your tickets. They make it better at solving the 500 tickets in Verified. The Pro number strips most of that polish away.

How to read benchmark numbers in a pitch deck or contract

This is the section to forward to whoever is signing the cheque. Whether the deal is a Mumbai studio quoting an enterprise client in Singapore, or a London consultancy putting a vendor through a procurement gate at a UK insurer, the rules are the same.

Treat Verified as a floor, not a forecast. A model below 70% Verified is unlikely to do well on your repo. A model above 80% Verified might. The score does not differentiate well at the top.
Always ask for the Pro number. If the vendor cannot produce one, that is information. If the gap is wider than 25 points, that is also information.
Specify the harness in the contract. 'The model under SLA is Claude Opus 4.7 at the official Anthropic API endpoint, evaluated with the openhands scaffold version X' is a contractable sentence. 'The model scores 87.6% on SWE-bench' is a marketing sentence.
Build a bridge clause. Tie any productivity claim to a custom evaluation on a slice of your real backlog. Twenty closed PRs from the last quarter is enough to embarrass the worst vendors and reward the honest ones.
Be honest in your own deck. If you are a UK or Indian builder pitching to international clients, quote the Pro number and explain the gap. Sophisticated buyers reward technical honesty; the unsophisticated ones will be confused either way.

Editorial observation

A pattern AI Tech Connect has watched repeat across UK and Indian procurement cycles: the bid quoting 88% Verified wins the first round, the pilot underperforms within six to eight weeks, and the rebid goes to the team that quoted the Pro number honestly — often at a higher rate. The honest number is the long-game number, but it costs you the first round.

— AI Tech Connect editorial

What benchmark would actually predict how a model handles your codebase?

The honest answer is none of the public ones. Verified, Pro, Multilingual and Lite are all useful as triage filters and as signals of frontier movement, but none of them sees your authentication middleware, your bespoke build tooling, the half-deprecated ORM you cannot replace this quarter, or the way your senior engineers actually write tickets. A custom evaluation of twenty real closed pull requests from your last six months — diff redacted, issue text as input — is the single highest-leverage hour an engineering lead can spend in a model selection.

For a related comparison of how scope changes a benchmark conversation entirely, Cursor Composer 2 reports 73.7% on SWE-bench Multilingual, which is a different evaluation set and not directly comparable to a Verified score on a single-language slice. Cross-benchmark comparisons are the second-most common error in vendor decks, after quoting Verified as a productivity prediction.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

The takeaway for buyers and Builders

The 23-point gap between Claude Opus 4.7 on Verified and Pro is not a problem to be solved. It is a measurement that finally makes the benchmark conversation honest. If your work involves quoting numbers — in a pitch, in a procurement document, in an SLA — quote the Pro number when you can, explain the gap when you have to, and build a bridge clause to a custom evaluation in every contract that hinges on a productivity claim. Buyers are getting more sophisticated, slowly. Builders who treat them as if they are will win the deals worth winning.

Primary sources for this piece: the official SWE-bench leaderboard at swebench.com and the Verified split; aggregator views at llm-stats.com and benchlm.ai; and the SWE-bench Pro framing and leaderboard at morphllm.com/swe-bench-pro and labs.scale.com/leaderboard/swe_bench_pro_public.