Does Composer 2.5 actually beat Claude Opus 4.7 and GPT-5.5?

Cursor's own benchmarks claim Composer 2.5 matches Opus 4.7 and GPT-5.5 across three coding suites at significantly lower cost. These are first-party numbers. Treat them as a hypothesis to test against your own task distribution before you change your default model — not a settled result.

What is Composer 2.5 built on?

Cursor states Composer 2.5 is built on Kimi K2.5 and trained on roughly 25 times more synthetic coding tasks than its predecessor, Composer 2. It is tuned for file edits, terminal commands and MCP-style tool calls inside Cursor's agent mode, with emphasis on intent understanding and tool selection.

What does the @cursor/sdk let me build?

The @cursor/sdk is a TypeScript package (public beta since 29 April 2026) that exposes the same agent runtime powering the Cursor desktop app, CLI and web app. Agents can run locally or in Cursor's cloud, where each agent gets its own VM with your repository already cloned — useful for CI checks, batch refactors and embedding a coding agent in your own product.

How does Cursor compare to Antigravity, Kiro and Cline?

Cursor competes with Google's Antigravity and Amazon's Kiro in the agent-first IDE space, and with Cline on the embeddable-runtime side. Each makes a different trade on model lock-in, deployment and openness. Pick on your constraints — cloud provider, data residency and whether you need to embed the runtime in your own product.

Cursor Composer 2.5 and the Cursor SDK: What Changed

What you need to know

Composer 2.5 shipped around 18 May 2026, built on Kimi K2.5 and — per Cursor — trained on roughly 25 times more synthetic coding tasks than Composer 2.
Cursor published its own benchmarks comparing Composer 2.5 against Composer 2, Claude Opus 4.7 and GPT-5.5 across three suites, claiming it matches the frontier models on coding at significantly lower cost.
Those numbers are first-party. "Matches at lower cost" is a claim to verify, not a fact to repeat. Run it against your own task distribution before you switch your default model.
The @cursor/sdk entered public beta on 29 April 2026 — a TypeScript package that exposes the same agent runtime powering the Cursor desktop app, CLI and web app, runnable locally or in Cursor's cloud.

Pro tip

Before you read a single vendor benchmark chart, write down the five tasks your team actually ships most weeks — a typed API endpoint, a migration, a flaky-test fix, a small refactor, a bug triage. Score any new model against those, not the suite the vendor chose. A model that wins on a public benchmark and loses on your real work is a worse buy than the reverse.

What Composer 2.5 is, in plain terms

Composer is Cursor's in-house coding model, designed to run inside Cursor's agent mode rather than to be a general-purpose chat model. Version 2.5, released around 18 May 2026, is built on Kimi K2.5 — the open-weight base from Moonshot AI — and, according to Cursor, was trained on approximately 25 times more synthetic coding tasks than Composer 2. The headline is not raw model size; it is task specialisation. Composer 2.5 is tuned specifically for the three things an agent does all day inside an editor: applying file edits, running terminal commands, and calling MCP-style tools. Cursor's framing puts the emphasis on intent understanding and tool selection — knowing which tool to reach for and when, which is where most agent loops actually fail.

That specialisation matters for the cost argument. A model that has been drilled on tool-use patterns can complete a task in fewer, better-chosen steps. Fewer steps means fewer tokens means lower cost — independent of the per-token price. So when Cursor says Composer 2.5 "matches at lower cost", part of that is a cheaper model and part of it is a tighter loop. For a Builder evaluating it, those two effects need to be separated, because only one of them transfers to your workload.

Reading Cursor's benchmark claims like a Builder, not a buyer

Cursor published benchmark comparisons of Composer 2.5 against Composer 2, Claude Opus 4.7 and GPT-5.5 across three suites, and claims it matches Opus 4.7 and GPT-5.5 on coding benchmarks at significantly lower cost. Read that sentence carefully: every clause in it is Cursor's. These are Cursor's own benchmarks, run on Cursor's chosen suites, scored by Cursor. That does not make them dishonest — first-party benchmarks are normal and often the only timely data available — but it does change what you should do with them.

Three questions cut through most vendor benchmark noise:

What is the task distribution? A model can ace algorithmic puzzles and stumble on a sprawling enterprise monorepo with bespoke conventions. The relevant question is whether the benchmark tasks resemble your codebase — a Spring Boot service in Pune, a Rails monolith in Manchester, a React Native app maintained across both.
What does "matches" mean numerically? Cursor has not published a result that lets you say Composer 2.5 beats Opus 4.7 or GPT-5.5 as fact, and you should not state it as such. "Matches within the margin Cursor measured" is the honest reading.
What is the real cost delta on your traffic? "Significantly lower cost" is a relative claim. Translate it into your own currency and your own monthly volume before it means anything to a finance team in Bengaluru or London.

Watch out

Do not let a vendor benchmark become your migration decision. The number that should move you is the one your own eval harness produces on a frozen set of 30–50 real tasks from your backlog, scored blind. If you do not have that harness yet, building it is a higher-leverage investment than any single model switch — it makes every future model decision a measured one rather than a vibe.

How to actually evaluate "matches at lower cost"

Here is a pragmatic, vendor-neutral protocol any team can run in an afternoon. Freeze a set of representative tasks — pulled from closed pull requests, not invented — and capture, for each candidate model, three things: pass rate (did the change satisfy the spec and tests), human-edit distance (how much did a reviewer have to fix afterwards), and total cost per task. The edit-distance metric is the one teams skip and the one that matters most: a model that "passes" but leaves a mess your senior engineer has to untangle is not cheaper, it has just moved the cost off the invoice and onto a salary.

Run each model on the identical task set, score blind where you can, and only then compare. A model that wins on cost-per-task and loses on edit distance is a false economy for high-stakes code and a bargain for throwaway scaffolding. Most teams will land on a routing answer rather than a single winner: cheap-and-specialised for boilerplate and well-specified edits, frontier-and-expensive for the gnarly, ambiguous work. Composer 2.5's pitch is precisely that it widens the band of work where the cheap option is good enough.

The @cursor/sdk: the runtime leaves the editor

The more structurally interesting release is the SDK. The @cursor/sdk entered public beta on 29 April 2026 as a TypeScript package that exposes the same agent runtime that powers the Cursor desktop app, the Cursor CLI and the Cursor web app. In other words, the loop you have been using interactively inside the editor is now something you can call from your own code.

The deployment model is the part Builders should study. Agents created through the SDK can run locally or in Cursor's cloud. In the cloud path, each agent gets its own virtual machine with the repository already cloned — so a fan-out of agents can work in parallel on isolated checkouts without trampling each other, and without you provisioning the compute. That is the difference between "I can script my editor" and "I can run fifty coding agents against fifty branches in CI and collect the results".

Local versus cloud: the trade-off

The choice between the two execution paths is the practical decision the SDK forces, and it maps cleanly onto constraints Indian and UK teams already think about.

Local keeps source code on your own machines — important for teams with strict data-residency rules, or those under DPDP obligations in India or UK GDPR for sensitive repositories. You own the compute, the latency is predictable, and nothing leaves your network. The cost is that you provision and babysit the machines.
Cloud gives you elastic parallelism and a per-agent VM with the repo pre-cloned, which is ideal for batch refactors, large-scale codemods and CI-time checks that would saturate a laptop. The trade is that your code transits and runs on Cursor's infrastructure — a procurement and compliance conversation you should have before the pilot, not after.

import { Cursor } from "@cursor/sdk";

const cursor = new Cursor({ apiKey: process.env.CURSOR_API_KEY });

// Cloud execution: an isolated VM with the repo already cloned
const agent = await cursor.agents.create({
  model: "composer-2.5",
  repo: "github.com/acme/payments-service",
  target: "cloud",            // or "local"
  task: "Add a /healthz endpoint that checks the Postgres connection, plus a test.",
});

for await (const event of agent.stream()) {
  if (event.type === "tool_call")   console.log("tool:", event.name);
  if (event.type === "file_change") console.log("changed:", event.path);
}

const result = await agent.result();
console.log("status:", result.status);

The snippet above is illustrative of the shape, not a verbatim API contract — check Cursor's current documentation before you ship. The pattern is what matters: declare a model, point at a repository, choose local or cloud, dispatch a task, and consume a stream of events you can render in your own product or pipe into CI.

Where this sits in the agentic-IDE landscape

Cursor is not alone, and the comparison sharpens the decision. Google's Antigravity and Amazon's Kiro are the two big-cloud entrants in the agent-first IDE category, while Cline attacks the same ground from the open-source, embeddable-runtime side. The table below uses only the facts established in this piece — read it as a decision frame, not a leaderboard.

Tool	Model basis	SDK?	Deployment
Cursor (Composer 2.5)	In-house Composer 2.5, built on Kimi K2.5	Yes — @cursor/sdk (TypeScript, public beta)	Local or Cursor cloud (per-agent VM, repo pre-cloned)
Antigravity (Google)	Google models	Agent-first IDE platform	Google's platform
Kiro (Amazon)	Amazon's agentic IDE stack	Agent-first IDE	Amazon's platform
Cline	Provider-neutral (bring your own key)	Yes — open-source @cline/sdk runtime	Self-host / embed

The honest reading: if you are already deep in a hyperscaler, Antigravity or Kiro may be the lower-friction default because they sit inside a cloud relationship you have already signed. If you want a provider-neutral, fully open runtime you can embed and self-host, Cline is the cleaner fit. Cursor's position is the middle path — a polished, opinionated editor with a strong in-house model and an SDK that lets you take the runtime out of the editor and into your own pipelines and products. None of these is "best" in the abstract; the right one falls out of your cloud, your data-residency posture, and whether you need to embed the runtime at all.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

The wider context: tools, protocols and frameworks are converging

Composer 2.5's emphasis on MCP-style tool calls is not incidental. The Model Context Protocol has now crossed roughly 200 server implementations, which means an agent that calls tools well has a genuinely large surface to act on — databases, issue trackers, internal services — without bespoke glue for each one. On the orchestration side, LangGraph v1.2 shipped in May 2026, continuing the trend of treating multi-step agent control flow as something you compose explicitly rather than pray over. The picture across all of this is the same: the model is becoming one component in a stack where the protocol layer, the runtime layer and the orchestration layer are all maturing at once. A Builder who treats the model choice as the whole decision is optimising one variable in a four-variable system.

What an Indian or UK team should actually do this month

Concretely: if you already pay for Cursor, the Composer 2.5 evaluation is nearly free — flip your default to it on a branch, run your frozen task set, and measure pass rate, edit distance and cost-per-task against your incumbent. If the numbers hold up on your work, route the well-specified majority of tasks to Composer 2.5 and reserve a frontier model for the ambiguous, high-stakes minority. If you are building a product that needs a coding agent inside it — an internal review bot in Gurugram, a migration service for a London consultancy — spike the @cursor/sdk against a real repository, decide local versus cloud on your data-residency constraints first, and only then judge the developer experience.

The meta-point survives any single release: the teams that win this cycle are the ones with a measurement habit, not the ones chasing the latest chart. Composer 2.5 may well be a genuinely good deal for a large band of everyday coding work. The way you find out is your own eval harness — and once you have built it, every future model announcement becomes a quick experiment instead of a leap of faith.

Cursor's release notes and benchmark methodology are on the Cursor blog; independent coverage of the release, pricing and benchmarks is at beyondtmrw.org.

Cursor Composer 2.5 and the Cursor SDK: what actually changed for Indian and UK dev teams