What this means for coding-agent buyers

On 18 May 2026, Cursor shipped Composer 2.5 — the second iteration of its in-house coding model — and the colour of the conversation around coding agents changed overnight. The headline is not just another benchmark win. It is the price tag.

  • Frontier-class quality at a tenth of the price — Composer 2.5 sits inside a percentage point of Opus 4.7 on SWE-Bench Multilingual and edges out GPT-5.5 on CursorBench v3.1 default-effort.
  • Standard-tier input at $0.50 per million tokens — that is 10× cheaper per token than Opus 4.7 list pricing. For high-volume batch and background-agent work, the unit economics no longer look like a frontier model.
  • Independent indices placed Cursor near the top of the coding-agent table at launch — with a cost-per-task curve well below the closed-source incumbents above it, per Artificial Analysis's Coding Agent Index.
  • Built on Moonshot AI's open-weight Kimi K2.5 base — but with 85% of post-launch compute spent inside Cursor's own post-training pipeline. The Cursor signature is in the reinforcement-learning loop, not the parameters themselves.
  • The catch is the fast tier — interactive IDE use defaults to $3 in / $15 out per million tokens, six times the headline rate. The savings only materialise on background and batch workloads.
Pro tip

Audit the split between interactive and background coding-agent calls in your team's Cursor usage logs before you celebrate the price drop. Most shops we have looked at run 70% interactive (fast tier) and 30% background — the blended saving is closer to 3× than 10×, but still material.

The benchmark numbers, plainly

Three benchmarks matter for a coding agent that is going to live inside a working developer's editor: SWE-Bench Multilingual (end-to-end task completion across languages), CursorBench v3.1 (Cursor's own internal harness, which we treat with a pinch of salt but cannot ignore), and Terminal-Bench 2.0 (shell-driven agent work). Composer 2.5's scorecard against Opus 4.7 and GPT-5.5 looks like this.

Benchmark Composer 2.5 Opus 4.7 GPT-5.5 Verdict
SWE-Bench Multilingual 79.8% 80.5% Effectively tied with Opus 4.7
CursorBench v3.1 (default effort) 63.2% 61.6% (xhigh) 59.2% (medium) Composer 2.5 leads on Cursor's harness
Terminal-Bench 2.0 69.3% 69.4% Statistical tie with Opus 4.7

The honest reading: on its own home benchmark, Composer 2.5 wins. On the two more neutral harnesses, it is within sampling noise of Opus 4.7. There is no version of these numbers where Composer 2.5 is the frontier — there is also no version where Opus 4.7 is meaningfully better for the kind of well-scoped coding task that gets fed to an agent in production.

Anyone framing this as "open-source has overtaken closed" is reading the headline and not the substrate. The base model is open weights (Kimi K2.5), but the agent your team actually uses is gated behind Cursor's post-training and infrastructure. That said — the fact that the substrate can come from an open-weight model and reach this tier of capability is the more interesting story for the medium term. The same dynamic that made the Qwen3-6 27B coding agent feasible on a single H100 is now reaching frontier-class agent performance with the right post-training.

Why $0.50 input matters

Set aside the benchmarks for a moment and run the pricing comparison directly. This is the table that ought to be on every engineering leader's desk this week.

Model Input $/Mtok Output $/Mtok Per-task delta vs Opus 4.7
Composer 2.5 (standard) $0.50 $2.50 ~10× cheaper
Composer 2.5 (fast) $3.00 $15.00 ~1.7× cheaper
Claude Opus 4.7 (list) $5.00 $25.00 Baseline

The standard-tier saving is the headline, and it is real. For a background coding agent that chews through ~5 million input tokens a day across PR reviews, lint-fix bots and codemods, the daily input cost falls from $25 to $2.50. Across a 40-engineer team that is roughly £550 per month off the books for an Indian shop, and closer to £900 once you factor in the lower output volume too.

The fast-tier price is the inconvenient detail. Interactive use inside the Cursor IDE — the tab-complete, the "fix this function", the live chat — defaults to fast. That is still ~1.7× cheaper than Opus 4.7 list, but it is not a 10× story. UK shops comparing total cost of ownership with Claude Code should run their own usage breakdown before committing to a switch.

Watch out

For the first week after launch (through roughly 25 May 2026), Cursor doubled the included usage limit on Pro plans. If you ran a price-comparison spike during that window, your unit economics are flattered. Re-run the analysis on a normal-quota day before you commit to a vendor change.

The Kimi K2.5 base plus Cursor post-training story

Cursor was unusually transparent about the recipe. Composer 2.5 starts from Moonshot AI's open-weight Kimi K2.5 base model — the same family that drove the Kimi K2.6 vs GPT-5.5 comparison a few weeks back. From there, Cursor spent 85% of the post-launch compute budget inside its own post-training pipeline: supervised fine-tuning on Cursor's internal task data, then reinforcement learning on 25× more synthetic coding tasks than the previous Composer version.

What that means in practice: the model is heavily specialised for the agent loop Cursor's IDE actually runs. Tool-use, file-edit syntax, multi-turn refinement, terminal interaction — all of it has been drilled in. It is the opposite of a generalist frontier model. It is a coding-specific instrument that happens to share the floor with the generalists on coding benchmarks.

This matters for the buyer because it tells you where the model will and will not transfer. Inside Cursor, it will behave like a frontier agent. Wired into a bespoke agent framework with a non-standard tool-call format, the post-training advantage evaporates and you are back to roughly Kimi K2.5 base-model performance — which is good, but not frontier.

From a verified Builder

"We swapped half our background PR-review queue from Claude Code to Composer 2.5 on day two. On the well-scoped reviews — lint, typing, dead-code — it is genuinely indistinguishable. On the architectural feedback prompts where we ask the model to flag larger design concerns, Opus 4.7 still gives the more thoughtful read. We are running both, routed by prompt category. The blended cost is down 38%."

— A verified Builder · London, UK

This pattern — split routing by task category — is the practical lesson from the launch. It mirrors the strategy we saw teams adopt when Opus 4.7 first dropped: do not migrate everything, route by workload shape.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written for builders. Browse profiles, shortlist who you want to hire or collaborate with on your coding-agent rollout.

Browse Builders →

Where Composer 2.5 still loses

Three honest gaps. None of them are fatal, but they are worth pricing in before you commit to a wholesale shift.

  1. Fast-tier pricing is the silent killer of the headline number. If your team's coding-agent usage is 80% interactive, the realised saving versus Opus 4.7 is ~1.7×, not 10×. Be honest about your usage mix.
  2. Long-horizon agent work still favours Opus 4.7. On the kind of multi-hour, multi-file refactor where the agent has to keep track of intent across dozens of steps, Claude Code still feels more reliable. Composer 2.5 occasionally loses the plot on tasks longer than 20-30 tool calls. This is improving fast, but it is real today.
  3. Cursor-only distribution. Composer 2.5 is not available as a general-purpose API outside the Cursor IDE. If your team's agents run in CI, in a custom IDE plug-in, or inside a non-Cursor product, the model is not accessible to you. Compare with Opus 4.7 and GPT-5.5, both of which ship via standard chat-completion APIs.

The first two will narrow with the next version. The third is a strategic call by Cursor — they want the model to be a reason to use Cursor, not a reason to use any IDE. That is a legitimate position but it is also a lock-in. Worth factoring in to any procurement decision, especially given Cursor's $60B SpaceX acquisition-option deal and the broader question of what happens to pricing power after that closes.

Buyer's call: when to switch, when to wait

The decision is workload-shape sensitive, not vendor-loyalty sensitive. Run the analysis honestly and the answer drops out.

Switch immediately if: your team is already inside Cursor, the bulk of your coding-agent spend is on background or batch jobs, and you are not running a separate agent framework that would need rewiring. The standard-tier saving will pay for itself in week one. Indian shops on Cursor Pro have the cleanest version of this case — the rupee cost of Opus 4.7 was already a stretch for many teams, and a 10× saving on the background portion of the workload is genuinely material.

Wait and split-route if: you run mixed workloads with a meaningful long-horizon component, or your team is split across Cursor and the wider coding-agent landscape like Aider, Cline, Continue and Claude Code. Use Composer 2.5 for well-scoped, high-volume tasks; keep Opus 4.7 in rotation for the architectural and multi-step work. The blended saving is still significant and you keep optionality. Most UK teams we have spoken to are landing here.

Stay on Claude Code if: your agent framework is bespoke and lives outside the Cursor IDE, your usage is dominated by long-horizon multi-file tasks, or you need a model accessible via standard chat-completion APIs across multiple front-ends. Composer 2.5 is not the right tool for that shape of work today.

One bigger-picture observation: this launch is exactly the dynamic the record Q1 2026 VC funding round made inevitable. Capital pouring into AI tooling means the model floor keeps rising and the price floor keeps falling. We are now firmly in the era where a coding agent at frontier-coding parity costs $0.50 per million input tokens. Twelve months ago that price was $15. The buying playbook has to keep up.

The GPT-5.5 launch earlier this quarter set the pace; Composer 2.5 just pulled the price floor down hard. Expect Opus 4.7 to respond on either price or capability inside 60 days. For now, the buyer's playbook is: split-route, audit your usage mix monthly, and do not assume the leaderboard at the top of this article will look the same in August.