The 2026 computer-use moment
Computer-use agents have been the hardest demo in AI to keep honest since the first screen-grab prototypes shipped in late 2024. For most of 2025 the consensus was that the category was interesting but brittle — the models could click, type and screenshot, but the failure modes were stark enough that nobody outside a research lab let one anywhere near a real workflow. The first five months of 2026 changed that. By mid-May we have three production-grade computer-use offerings from the three frontier labs, each one shipped with a deliberately different architecture and a deliberately different target customer.
Anthropic's bet, which it has refined steadily since the original computer-use beta, is the most architecturally pure. Claude Computer Use is a portable tool: the model receives a screenshot, returns mouse and keyboard actions, and the harness around it can sit on top of anything — a Linux VM, a Docker container, a remote Windows desktop, a Chromium-based kiosk. The model carries the intelligence; the substrate is interchangeable. Paired with Claude Opus 4.7's 1M-token window, the design specifically optimises for long-horizon tasks where the agent needs to remember every step it has already taken on an unfamiliar machine.
OpenAI's bet, shipped on 16 April 2026 as part of the "Codex for almost everything" announcement, is the opposite. Codex Background Computer Use does not abstract the desktop — it is the desktop, specifically a real macOS environment running in the background of an engineer's laptop. It opens its own apps, edits its own files, drives its own browsers, and crucially does this in parallel with whatever the human in front of the screen is doing. The bet is that engineers want a colleague who can borrow a virtual seat at the same machine, not a robot that lives in a sandbox.
Google's bet, which evolved out of the Project Mariner research programme, is browser-shaped. Gemini Computer Use does not see pixels first; it sees the DOM, the form fields, the link targets, the structured content of any web page. It will still operate on a screenshot when it has to, but its strongest behaviours are the ones where it can reason about the page as a structured artefact. For workloads that live on the web — and most enterprise workloads do — Mariner's lineage is a meaningful advantage.
The three bets in a sentence each:
- Anthropic: make the tool portable, let the model carry the intelligence.
- OpenAI: make a real macOS user that runs in the background.
- Google: make the browser the universal interface.
The three at a glance
One table to bring to the architecture review. Each column captures the choice you are actually making when you pick a vendor; the OSWorld snapshot is the headline number, not the whole story.
| Vendor | Architecture | OS scope | Trust model | Best for | OSWorld snapshot |
|---|---|---|---|---|---|
| Anthropic Claude Computer Use | Portable {screenshot + mouse + keyboard} tool | Anything that runs a VM or container | VM-isolated, snapshot-and-replay | Long-horizon, cross-app, regulated | Strong on file ops and research |
| OpenAI Codex Background | Native background session on real OS | macOS today, Windows + Linux on roadmap | Runs as your user — same blast radius | Parallel developer-loop tasks | Leads on developer-flavour benchmarks |
| Google Gemini Computer Use | DOM-aware browser agent | Anywhere Chromium runs | Browser-profile sandbox, real-time human intervention | Web automation, form filling, research | Leads on browser-automation categories |
Resist the urge to choose on OSWorld headline scores. The benchmark is genuinely useful, but the variance between task categories within it is larger than the variance between vendors on any single category. Score the vendor on the categories that match your actual workload — file ops, browser automation, form filling, research — not the global average.
Anthropic Claude Computer Use — portable tool, deep reasoning
Anthropic's design treats the computer like any other tool. You declare a tool, give it a name, and the model can call it with structured arguments. The novel piece is the trio of primitives Claude knows how to emit: screenshot, mouse_click at coordinates, and type with key sequences. The harness — the code you write — is responsible for taking those actions on whatever desktop you have wired up.
The implication is significant. Because the harness is yours, the substrate is yours. A bank in Mumbai can run the harness inside a hardened Linux VM that never touches the public internet. A Manchester healthcare provider can pin the harness to a Citrix-published Windows session and inherit existing audit. A Bengaluru BPO can spin up fifty disposable Docker desktops for parallel form-filling work without touching anyone's laptop. The substrate is whatever the compliance team already trusts.
The invocation pattern is small enough to fit on a slide:
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=[{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1280,
"display_height_px": 800,
}],
messages=[{
"role": "user",
"content": "Open the invoices folder, find unpaid items "
"from April 2026 and export a CSV to ~/Desktop."
}],
)
# resp.content includes tool_use blocks; your harness
# executes them on the target VM and feeds screenshots back.
for block in resp.content:
if block.type == "tool_use":
action = block.input # {"action": "screenshot"} etc.
result = harness.execute(action)
# Loop: send result back as tool_result
What you get for the abstraction tax is twofold. First, portability: the same agent code runs against a Linux VM today and a Windows VM tomorrow with no model change. Second, the 1M-token context window — the agent can carry long multi-step sessions of screenshots, decisions and outcomes within its working context, which is a meaningful difference on long-horizon tasks where the lesson learned in step three is what unblocks step forty.
OpenAI Codex Background — macOS-native parallel sessions
OpenAI's 16 April 2026 announcement made one architectural commitment very clear: the computer-using agent should feel like a colleague who has been given their own seat at your machine. Codex Background does not run in a VM at someone else's data centre; it runs on macOS, in the background, as a process that opens its own apps, drives its own browsers and edits its own files.
The headline pattern is parallel dispatch. An engineer kicks off three Codex sessions before lunch — one resolving a long-running TypeScript refactor, one preparing a release-note draft, one chasing a flaky test through CI logs — and goes back to whatever they were doing in the foreground. The sessions run, surface progress, and only pull the engineer's attention when something needs a decision. The pitch is not "the agent solves it for you"; it is "the agent runs in parallel and you get more shots on goal per hour".
The dispatch pattern looks roughly like this:
from openai import OpenAI
client = OpenAI()
tasks = [
"Refactor src/billing/ to remove the old InvoiceV1 path. "
"Run the full test suite. Open a PR.",
"Read RELEASES.md, draft v2.4.0 notes from git log v2.3.0..HEAD. "
"Save as RELEASES_DRAFT.md.",
"Re-run the flaky e2e suite three times. Capture stack traces "
"and screenshots. Open an issue if it fails consistently.",
]
sessions = []
for t in tasks:
s = client.codex.background.sessions.create(
model="codex-1",
instructions=t,
computer_use=True, # drive the GUI when needed
workspace="~/code/ledger", # native macOS path
)
sessions.append(s)
# Sessions stream events; the dashboard lets you peek without
# stealing focus from your primary workspace.
The cost of the design is also stark. Codex Background is macOS-first as of May 2026, with Windows and Linux on the roadmap but no published date. For teams standardised on Ubuntu developer fleets or Windows endpoints — which is most of the regulated UK estate and a large slice of the Indian GCC market — this is a blocker, not a preference. The other cost is the trust model: the agent runs as your user, with your permissions, on your machine. The blast radius of a mistake is your laptop, not a disposable VM.
Codex Background's parallel-session pattern is genuinely productive, but the trust model is the part to debate first. Running multiple agents as your macOS user means file system, keychain, browser cookies and SSH keys are all in scope. Use a dedicated developer account for agent work, scope keychain access, and never run a background session against an account that has push rights to production.
Google Gemini Computer Use — browser as the universal interface
Project Mariner started as a Google DeepMind research effort focused specifically on what a model could do if it was given structured access to the browser. The lineage matters because it shaped the design choice that distinguishes Gemini Computer Use today: the agent reads the DOM, not just the pixels. When a form has fifty hidden fields, Mariner's descendant knows it; when a link has an aria-label that contradicts its visible text, the model sees the contradiction; when a single-page app re-renders, the agent doesn't have to re-screenshot to know what changed.
The argument for the browser-as-universal-interface bet is empirical. Most enterprise workflows that look like "desktop automation" are actually browser automation in disguise — the CRM is a web app, the procurement portal is a web app, the GeM tender system is a web app, the NHS Trust supplier portal is a web app. If you can collapse the substrate to "Chromium", you inherit a much smaller and more stable surface than "any desktop, anywhere".
A browser session is the unit of work:
from google import genai
client = genai.Client()
session = client.computer_use.create(
model="gemini-2.5-computer-use",
surface="browser",
start_url="https://supplier-portal.example.nhs.uk",
human_in_the_loop=True, # pause for confirmation on writes
)
session.send(
"Log in with the credentials in secret://nhs-supplier, "
"find purchase orders dated April 2026, export the list as CSV "
"and email it to procurement@trust.example.uk."
)
# DOM-aware events stream back: clicks, form fills, network
# requests, validation errors. Intervene at any point.
for evt in session.events():
if evt.requires_confirmation:
evt.approve()
The real-time human control is a Mariner inheritance and it matters. Gemini Computer Use is designed so that a human can intervene, redirect or stop at any point during an in-flight session. For workloads where a wrong click costs money — invoice payments, regulatory submissions, customer-data writes — this is the difference between "we let it run" and "we let it run with a pilot in the loop".
Benchmarking — OSWorld is the shared scoreboard, but task-shape matters
All three vendors publish OSWorld-Verified scores. The benchmark is a useful common reference — it covers a wide spread of desktop and browser tasks, and the verification harness reduces the "we got lucky on the demo" problem. But the global average is not the number you should be optimising against. The category breakdown is.
| OSWorld task category | Typically strongest | Why |
|---|---|---|
| File operations (rename, move, search) | Claude Computer Use | Long context lets the agent track the full directory state across many steps |
| Browser automation (navigate, click, submit) | Gemini Computer Use | DOM-aware grounding beats screenshot-only on dynamic single-page apps |
| Multi-step form filling | Gemini Computer Use | Mariner lineage was explicitly trained on this shape |
| Research (read, synthesise, write artefact) | Claude Computer Use | 1M-token window keeps every source open while writing the output |
| Developer-loop GUI tasks (IDE, terminal, simulator) | Codex Background | Native macOS access plus parallel sessions outpaces sandboxed alternatives |
| Cross-application orchestration | Codex Background (macOS), Claude (other OS) | Native OS hooks beat tool-emulated equivalents |
The honest reading is that the leaderboard scrambles when you slice by workload. A vendor that wins your category by ten points matters more than one that wins the global average by two. For a London regulated industries client running ledger-reconciliation work across Excel and a legacy thick-client, Claude in a hardened Windows VM is the right answer. For the same client's marketing team running campaign-performance pulls across six SaaS dashboards, Gemini Computer Use wins. These are not the same procurement.
Compare more agent stacks with the people shipping them
AI Tech Connect is the directory of Verified Builders across India and the UK. Shortlist the engineers running these systems in production — we email you their contact details.
Browse Builders →Trust and control — who you let run on what
The architecture choice maps almost directly onto a security posture. Pixel-perfect demos aside, what your security review actually cares about is blast radius, observability and reversibility.
Claude Computer Use in a disposable VM is the strongest pattern when the agent has to touch unfamiliar systems. The VM is the trust boundary; the agent cannot escape it, and the snapshot-and-replay primitive means you can rewind the entire session if something looks wrong. This is the pattern Bengaluru-based BPO automation teams are converging on for client-facing work, because the client's compliance team can audit the VM image once and then trust every session built on top of it.
Gemini Computer Use scoped to a browser profile is the lowest-risk pattern for web work. The profile is a sandbox in itself — no extension access, no file system, no parallel-tab leakage — and the real-time human-in-the-loop intervention model means the worst-case outcome is "the operator catches it and stops the session". For NHS Trust supplier-portal automation and similar regulated-industry web workflows, this is the design that gets through compliance fastest.
Codex Background running against a developer's own macOS account is the highest-trust configuration of the three because the blast radius is the laptop in front of you. That is fine for engineering productivity work and it is exactly wrong for anything customer-facing. The pattern that works for production environments is to spin Codex against a dedicated macOS user account with no production credentials and no SSH keys, and treat its output the same way you treat any other intern's pull request — review it.
"We run Claude in a disposable Linux VM for our regulated BPO clients in Bengaluru, Gemini in a locked-down Chromium for the form-filling work and Codex on the developer laptops only. Same model family for each company, three completely different harnesses. The decision is the harness, not the model." — A Verified Builder · Bengaluru
MCP — the shared rail under all three
The single most underweighted story in this category is the protocol layer underneath. In February 2026 Anthropic donated the Model Context Protocol to the Linux Foundation, where it is now governed by the Agentic AI Foundation — 146 member organisations, including AWS, Google, Microsoft and OpenAI. By March 2026 MCP was clocking 97 million monthly SDK downloads. All three computer-use agents in this comparison speak it natively.
What MCP changes for computer-use agents specifically is that the auxiliary tools the agent depends on — a secrets store, an image-recognition helper, a domain-specific data lookup — no longer have to be bundled with any one vendor's harness. You run an MCP server, register it once, and Claude, Codex and Gemini can all call it. For a UK public-sector buyer who wants to be able to substitute model vendors without rewriting the tool layer, this is the procurement-friendly story. For an Indian GCC standardising agent infrastructure across multiple client engagements, it is the cost-control story. We covered the broader interop arc in our AGNTCY interoperability piece; the bit that matters here is that the agentic surface is now genuinely portable across vendors at the tool layer.
That portability does not extend to the screen primitives — Claude's tool, Codex's session API and Gemini's browser surface are still vendor-specific shapes. But it does mean that the capabilities your agent reaches for after it has clicked into an application are no longer locked into any one stack. That changes the cost of switching vendors from "rewrite everything" to "rewrite the click loop, keep the tool layer".
Implementation decision tree — pick one in five questions
Compress the matrix into five questions. Answer them in order and the right vendor falls out.
- What is the target operating system in production? If it is anything other than macOS today, Codex Background is off the table until Windows and Linux land. If the production target is a mix, pick a vendor that already supports the mix — Claude (any OS via VM) or Gemini (any OS that runs Chromium).
- Browser-only or full desktop? If 90 percent of the workflow is web, Gemini Computer Use will win on browser tasks and the trust model is cleaner. If the workflow crosses into native apps — Excel macros, a desktop ERP client, a CAD tool — Claude or Codex.
- Do you need parallel sessions? Codex Background was designed around this. Claude supports it cleanly given one VM per session. Gemini supports it cleanly given one browser profile per session.
- Where does the data live, and what is the sovereignty constraint? If the answer is "Mumbai region, RBI-supervised entity" or "UK South, NHS Trust", you need a deployment that fits your existing cloud agreement. Claude via Bedrock and Gemini via Vertex have the broadest regional footprints; Codex Background runs on the laptop and dodges the question for engineering work but not for production.
- What is the trust model the security review will accept? Disposable VM (Claude), browser-profile sandbox with human-in-the-loop (Gemini), or trusted developer machine (Codex). Pick the one your compliance team has already signed off on for a comparable system.
If the questions push you toward more than one vendor, run a two-week parallel pilot. The cost of running two harnesses against the same workload for ten working days is far less than the cost of standardising on the wrong one for a year.
Where each one fails today
Honesty about failure modes is the most useful thing a comparison article can offer. Each vendor has them.
Claude Computer Use still pays a latency tax for the screenshot loop — every action involves taking a fresh image, sending it to the model and waiting for the next instruction. On dense GUI workflows with many small clicks, the round-trip adds up. The portability story is also as good as your harness; if you write a bad harness, the model cannot rescue you. Anthropic's reference harness is good but not production-grade for every desktop.
OpenAI Codex Background is gated by the macOS-first decision and that is the single biggest practical limitation. Beyond that, the trust model is the real failure mode in the wild: teams give it broader system access than they intended, and the first incident is usually "the agent committed and pushed something it shouldn't have". The fix is process discipline, not a product change. The parallel-session pattern also has an attention cost — six agents running in the background still produce six streams of notifications, and the dashboard does not automatically know which to escalate.
Google Gemini Computer Use is the strongest on browser workflows and meaningfully weaker the moment the workflow leaves the browser. Sites with aggressive anti-automation defences — high-stakes auth flows, banking portals that fingerprint Chromium — still trip it. The DOM-aware grounding also assumes the page has a sensible DOM; a heavy canvas-based app (think Figma) is a hard target. And while real-time human intervention is a feature, in practice it means an operator has to be present, which limits how much of the day you can leave the agent unattended.
What to ship in the next six months on top of these
The vendors will keep moving. The patterns that the builder community is converging on are more interesting than any individual SDK release.
First, harness libraries that abstract the click-loop primitives across vendors. A small open-source layer that exposes a single computer.click(x, y) interface and dispatches to Claude, Codex or Gemini underneath. The substrate moves around enough that the abstraction is worth the cost, particularly for consultancies running multi-client work.
Second, per-task vendor routing. If you accept that no single vendor wins every OSWorld category, the obvious move is to route browser tasks to Gemini, file-ops tasks to Claude and developer-loop tasks to Codex from inside a higher-level orchestrator. The cost is a small router; the gain is the per-category lift. We covered the orchestration depth required to do this well in Bayesian agentic orchestration.
Third, recordable, replayable sessions. A session that an operator can rewind, branch and re-run is the missing primitive for production trust. The 1M-token context window in long-context engineering makes a full-session replay viable for the first time; the harness is where the work is.
Fourth, skill bundles that travel. A "reconcile two CSVs" skill or a "submit a GeM tender" skill that the agent can load on demand and that works against any of the three vendors. The MCP layer is the right place for this and the early ecosystem is starting to show it. We covered the autopilot pattern in Claude Code autopilot and the agent-SDK landscape in our SDK wars piece; the same logic applies one layer up.
The short read for builders shipping this quarter: stop arguing about which vendor is best in the abstract. The architecture choices each lab made tell you exactly which workloads they will win. Match your workload to the architecture, write your harness against an abstraction you own, and treat the vendor underneath as a swappable runtime. The category is moving too fast for any other strategy to age well.
Primary sources for this piece: docs.anthropic.com/en/docs/build-with-claude/computer-use, openai.com (Codex Background launch, 16 April 2026), deepmind.google (Mariner / Gemini Computer Use) and modelcontextprotocol.io.