What just happened on the coding-agent leaderboard

In April 2026, an open-source scaffold called Live-SWE-agent posted 79.2% on SWE-bench Live — the dynamic, contamination-free coding-agent benchmark that has quietly become the reference point for honest comparison this year. That single number reframed an argument the industry has been having since SWE-bench Verified scores started cresting 90%: how much of the gap between open and closed coding agents is actually the model, and how much is the scaffold around it?

The answer, on this benchmark, is that the scaffold accounts for an enormous slice. A well-designed open harness — the agent loop, the tool stack, the planning prompts, the verifier — drives an open-weight model to within striking distance of closed systems built by labs with billions of dollars and proprietary post-training. That is the story for an Indian dev-tools start-up choosing what to layer on top of Qwen3 or DeepSeek, and it is also the story for a boutique UK consultancy advising a FTSE 100 engineering team on which coding agent to deploy internally.

  • Result: Live-SWE-agent scored 79.2% on SWE-bench Live, an open-source scaffold positioned at or near the top of the SWE-bench Live leaderboard for openly auditable systems.
  • Why Live matters: SWE-bench Live is dynamic and contamination-free, so a high score is not explained by the underlying model having memorised the tasks in pre-training.
  • Closed-model comparison: Claude Mythos Preview leads SWE-bench Verified at 93.9% but drops to 45.9% on the harder, contamination-resistant SWE-bench Pro. These are two separate benchmarks measured on the same model.
  • Industry context: 2026 has been called "the year benchmark trust collapsed". The conversation has moved from Verified to Pro and Live almost overnight.
  • Frame: the agent harness is competing with the model weights for share of the result. Builders who underinvested in the harness this year are paying for it in real money.
Pro tip

When you read a coding-agent score in 2026, read the benchmark name first and the percentage second. A 90%+ figure on Verified, a 70%+ figure on Live and a 40%+ figure on Pro can all describe the same agent on the same week. Anchor your team on Pro and Live; treat Verified as a historical artefact.

The three SWE-benches, in plain English

The naming is doing nobody any favours, so it is worth laying the three SWE-benches side by side before we look at any numbers.

SWE-bench Verified is the curated 500-task subset of the original SWE-bench. It was launched as the trusted measure of coding-agent quality, and for two years it did the job. The problem is that the underlying GitHub issues, repositories and patches are now well inside the training corpora of every major frontier model. Top systems are reporting 90%+ on Verified, and the consensus among honest evaluators is that a meaningful chunk of that score is memorisation rather than reasoning. Verified is still useful as a regression test, but it has stopped discriminating between agents at the top of the leaderboard.

SWE-bench Pro is the response. It is smaller, harder and contamination-resistant: a held-out set of fresh, hand-curated tasks that the maintainers keep deliberately out of public crawls. Pro scores read like a punch in the face for anyone used to Verified — the same Claude Mythos Preview that posts 93.9% on Verified scores only 45.9% on Pro. The 48-point gap is the honest measure of how much of the Verified score was learned rather than reasoned. We covered this gap in detail in our SWE-bench Verified-vs-Pro explainer; it is the single most useful frame for sanity-checking any 2026 coding-agent claim.

SWE-bench Live is the third leg. Where Pro is a fixed contamination-free set, Live is a continuously evolving stream of newly-filed real-world GitHub issues. Each round of evaluation uses tasks that postdate the model's training cut-off, so contamination is structurally impossible. The leaderboard moves every cycle, and a score is a snapshot rather than a final mark. Live is the closest the field has to a true out-of-distribution measure of how a coding agent will behave on a fresh enterprise bug ticket on Monday morning.

The comparison table — read the columns carefully

The crucial discipline here is that the three benchmark columns are not interchangeable. The same agent on the same week posts very different numbers depending on which one you run. The table organises what we know without conflating them.

Coding agent Benchmark Score Open / closed What to take away
Live-SWE-agent SWE-bench Live 79.2% Open-source scaffold Top of the openly auditable Live leaderboard; contamination-free, so the number reflects real reasoning rather than memorisation.
Claude Mythos Preview SWE-bench Verified 93.9% Closed model Headline 90%+ figure — but Verified is partially contaminated. Use as a directional signal only.
Claude Mythos Preview SWE-bench Pro 45.9% Closed model Same model, contamination-free harder set, 48-point honesty correction. The Verified-vs-Pro gap is the central 2026 lesson.
SWE-agent (original) SWE-bench Verified ~33% (indicative) Open 2024-era open scaffold, baseline against which Live-SWE-agent's progress should be measured.
Best closed agent SWE-bench Live ~80%+ (range) Closed Top closed systems sit broadly in the same band as Live-SWE-agent on Live — much closer than Verified suggested.

The row that does the work is the open-scaffold Live result next to the closed-model Pro result. Once you strip contamination away from both sides, the gap between the best open scaffold and the best closed model on a comparable contamination-free benchmark is much narrower than the Verified narrative suggested. That is a different industry to the one most procurement decks were written for in late 2025.

Watch out

Vendors will continue to quote Verified scores well into 2026 because the numbers are flattering. If a vendor refuses to publish a SWE-bench Pro or SWE-bench Live result for the exact same configuration, treat that as a red flag — not as a neutral data point. The Verified inflation problem is the single largest source of buyer surprise in coding-agent deployments this year.

Why the harness now matters as much as the weights

A coding agent is not just a model — it is a model wrapped in a loop. The loop reads code, plans an edit, calls tools, runs tests, observes results and decides what to do next. Everything outside the forward pass is the harness. For most of 2024 and 2025, the harness was treated as plumbing and the model weights got the credit. Live-SWE-agent's 79.2% on Live makes that framing untenable.

Two industry numbers, drawn from a Q1 2026 analysis of enterprise agentic AI deployments, sharpen the picture. There is a roughly 37% gap between lab benchmark scores and real-world deployment performance — almost all of which is harness-side: tool selection, error recovery, retry policy, verifier design, sandbox cost. And there is a 50× cost variation for similar accuracy across competing agent systems — again, almost all harness. Two systems can sit within a couple of points of each other on Live and bill at completely different orders of magnitude depending on how many tool calls they make per task, how often they retry and how aggressively they cache.

For a Bengaluru dev-tools start-up shipping a coding-agent product on open weights, that is the entire commercial story. The differentiator is not which open model you pick — Qwen, DeepSeek or one of the other recent open-weight coding models all sit in a similar band — it is the scaffold you wrap around it. For a boutique UK consultancy advising a FTSE 100 engineering team on internal adoption, the same logic applies: the meaningful question is not "Claude or open" but "which harness, on which benchmark, at what cost per resolved issue".

From the desk

"We were ready to write off open-weight coding agents at the end of 2025 — the Verified gap to Claude was too large to sell to a CTO. The Live-SWE-agent result changed the conversation in one meeting. We are now piloting an open scaffold for the bulk of routine refactors and routing only the hardest tickets to a closed model. The cost line is unrecognisable."

— Builder Desk · AI Tech Connect

How to evaluate a coding agent in mid-2026

If you are picking a coding agent today — whether to ship inside your product or to deploy on your engineers' machines — the evaluation discipline has changed. The Verified-only era is over. A short pragmatic checklist looks like this.

  1. Ask for a SWE-bench Pro score and a SWE-bench Live score for the exact configuration you will be running — same model, same scaffold, same tool stack. If only Verified is on offer, mark the vendor down a notch.
  2. Pin the harness, not just the weights. Two procurement bake-offs that swap the model but keep the harness will tell you about model quality; two that swap the harness but keep the model will tell you about scaffold quality. You need both, and most teams only run one.
  3. Measure cost per resolved issue, not cost per token. The 50× variation is real and visible only at the issue level. A cheap model with a profligate scaffold can cost more per resolved issue than an expensive model with a parsimonious one.
  4. Run a contamination-free pilot on your own repository. Pick ten recently filed tickets that the model could not have seen during pre-training, let the agent attempt them, and grade the patches yourself. This is the cheapest insurance policy in the entire procurement process.
  5. Budget for the 37% lab-vs-prod gap in your roll-out plan. If the leaderboard says 79%, plan a deployment success rate closer to 50% and a steady glide upward as you tune the harness to your repository conventions.

Most of that work is harness work, not model work. Picking the right agent framework is therefore as important as picking the right model. Our recent survey of agent frameworks — LangGraph, CrewAI, PydanticAI and Microsoft's stack — is the right companion read here, because the framework you choose constrains the scaffold patterns you can express cleanly.

What this means for the open / closed split

Claude Mythos Preview is, by any honest reading of the data, the strongest closed coding model in the market this quarter. Its 93.9% on Verified is real; its 45.9% on Pro is also real, and the gap is not a knock on the model — it is the price of evaluating any model on a contamination-resistant set. The same correction applies to every closed system; Claude is simply the one with the most visible disclosed numbers. We covered the launch itself in our Claude Mythos Preview piece, which sets out exactly what changed in the model behaviour.

The open-source story is now a serious commercial story rather than a hobbyist one. An open scaffold posting 79.2% on Live is the kind of result that gets approved by a UK procurement committee that previously would have ruled out anything non-Anthropic on capability grounds. It is also the result that gives an Indian start-up the cover to ship a coding-agent product on open weights and price it 3-10× below a closed-model competitor — because the underlying inference is cheap and the harness, not the weights, is the moat.

Shipping coding agents and want a sounding board?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Where this story still has open questions

  1. SWE-bench Live is a moving target. Today's 79.2% is a snapshot. The leaderboard will refresh, harder tickets will land, and any specific number should be treated as a position on a curve, not a final mark.
  2. Open scaffolds need maintenance budget. The harness is the moat, but only if someone is paid to keep it tuned to your repository. Open-source is free as in code; it is not free as in engineering time.
  3. The Verified-vs-Pro gap will narrow as Pro ages. Pro is contamination-free today, but every public benchmark eventually leaks into training corpora. Expect a SWE-bench Pro 2, then a Pro 3. Build your evaluation pipeline to absorb that churn.
  4. Cost discipline is still scarce. Few teams report cost per resolved issue publicly. Until that becomes standard, the 50× variation will continue to surprise buyers in production rather than at procurement.
  5. The closed-model lead on the hardest tickets is real. Live and Pro show open scaffolds within striking distance on average; they do not show parity on the hardest tail. Plan a routing layer that escalates the worst tickets to a closed model.

The Live-SWE-agent write-up is at agentmarketcap.ai, with cross-reads at codeant.ai and the public awesomeagents.ai leaderboard. The broader benchmark-trust collapse is covered well by Kili Technology's 2026 evaluations guide and by programming-helper.com's SWE-bench guide. Pair this piece with our Verified-vs-Pro explainer for the full benchmark context.