What you need to know

  • One checkpoint, three capabilities. M3 is the first open-weight model to put frontier coding, a 1M-token context and native multimodality together in a single release — not three separate models.
  • Coding parity with a closed flagship. 59.0% on SWE-Bench Pro (on par with GPT-5.5) and 66.0% on Terminal-Bench 2.1. It tops the open-weight SWE-Bench Pro leaderboard.
  • The 1M context is affordable, not a stunt. MiniMax Sparse Attention (MSA) cuts per-token compute to roughly 1/20 of the previous generation at 1M context, with about 9x faster prefill and 15x faster decode.
  • Pricing is the headline. $0.60 input / $2.40 output per million tokens — roughly 5 to 10 percent of closed-source models in its class. A launch promotion halved that to $0.30 / $1.20.
  • It is open weights. You can run it in your own region, on your own GPUs, under your own data-residency rules. That is the part that changes architecture decisions.

For the past two years, the trade-off was uncomfortable. You could have a frontier coding model, but it was closed, metered and lived in someone else's region. Or you could self-host an open-weight model and accept that it lagged the closed flagships on the hard agentic-coding benchmarks that actually predict whether an autonomous PR lands clean. MiniMax M3, released on 1 June 2026 as the company's open-weight flagship, is the clearest sign yet that the trade-off is collapsing.

It is not alone. M3 arrives into a 2026 field where DeepSeek V4, Qwen 3.5 (Apache 2.0), Gemma 4 and Llama 4 already rival closed flagships on individual axes. What sets M3 apart is breadth in one checkpoint: frontier coding scores, a genuinely usable 1M-token window, and multimodality that was baked in from training step zero rather than bolted on afterwards.

The coding numbers

The benchmark that matters most for build-versus-buy is SWE-Bench Pro — the harder, contamination-resistant cut that tracks whether a model can resolve real GitHub issues end to end. MiniMax reports 59.0% on SWE-Bench Pro for M3, which it puts on par with GPT-5.5, and 66.0% on Terminal-Bench 2.1, the agentic shell-and-tool benchmark. By MiniMax's own leaderboard, M3 sits at the top of the open-weight field on SWE-Bench Pro.

Model Weights SWE-Bench Pro Terminal-Bench 2.1 Input / output ($/M tok)
MiniMax M3 Open 59.0% 66.0% $0.60 / $2.40
GPT-5.5 (closed peer) Closed ~59% (on par) roughly 10–20x M3

Read those two columns together. Coding parity with a closed flagship is no longer the surprise — Mistral, DeepSeek and Qwen had already narrowed the gap through 2026. The surprise is parity plus open weights plus a price that is roughly 5 to 10 percent of the closed-source bracket. When the cheaper option is also the one you can run inside your own perimeter, the procurement conversation changes shape.

Pro tip

Do not anchor on a single benchmark number. Before you migrate a coding agent, replay 30 to 50 of your own historical issues through M3 and measure the clean-PR rate — the proportion that merge with zero human rework. A model that ties GPT-5.5 on SWE-Bench Pro can still behave differently on your codebase's conventions, and your own replay set is the only benchmark that pays your bills.

Why the 1M context is the real architecture story

Plenty of models claim a million-token window. The honest question has always been whether you can afford to use it. Attention cost scales painfully with sequence length, so a 1M-token call on a dense-attention model can be eye-watering — which is why so many long-context claims quietly assume you will never fill the window.

M3's answer is MiniMax Sparse Attention (MSA), a new sparse-attention architecture that cuts per-token compute to roughly one-twentieth of the previous generation at 1M context. MiniMax reports about 9x faster prefill and 15x faster decode at long context as a result. Those two speed-ups are what move a 1M window from a marketing line to a production primitive: prefill speed governs how quickly you can load a large repository or document set, and decode speed governs the cost and latency of generating against it.

For builders, that reframes a familiar set of patterns. A 400k-line monorepo with generated types fits in context without RAG plumbing. A full regulatory corpus — say the EU AI Act alongside the UK's frontier-AI guidance — can be read in a single pass for cross-reference work. Multi-document customer-research synthesis stops being a chunking exercise. The difference with M3 is that you can do all of this on weights you host, rather than renting the only model that could keep up.

Watch out

Sparse attention is an efficiency mechanism, not a recall guarantee. As with every long-context model, validate retrieval on your payloads before you trust a 1M-token call in production — run needle-in-a-haystack probes at the depths you actually use, and tool-call out to code for anything numerically exact. Cheap long context tempts teams to stuff the window instead of curating it; a tight, well-ordered 200k context still beats a sloppy 900k one.

Native multimodality, not a bolt-on

M3 was multimodal from training step zero: image and video input, computer use, and a toggleable thinking mode all sit inside the same checkpoint. That is a meaningful distinction from pipelines that wire a separate vision model in front of a text model. For builders, native multimodality plus computer use means a single open-weight model can read a screenshot, reason about a UI, and drive an agentic workflow — the kind of stack that, until recently, meant a closed computer-use API and the data-handling questions that come with it.

The toggleable thinking mode matters for cost control. You can run the model in fast, non-thinking mode for routine turns and switch on extended reasoning only for the hard ones — the same routing discipline that keeps closed-model bills sane, except now you own the switch.

Build versus buy — for India and the UK

This is where an open-weight frontier model genuinely changes the stack, and the calculus is regional. The decision is rarely about raw cost per token alone; it is about residency, predictability and control.

For teams in India, M3 lands squarely in the sovereign-compute conversation. With the IndiaAI Mission pushing GPU-hours down to a fraction of cloud-list rates and a domestic open-weight ecosystem maturing — see our look at India's AI ecosystem and Sarvam's open-sourced 105B model — self-hosting an open-weight frontier coder inside an AWS Mumbai region, or on subsidised national compute, becomes a credible default rather than an aspiration. Code and customer data stay onshore, and the DPDP residency conversation gets considerably simpler when the model never leaves your perimeter.

For teams in the UK, the driver is more often data-handling assurance and procurement. Running M3 in an AWS London region, or on a sovereign-cloud arrangement, lets a regulated firm keep source code and client material inside a controlled boundary while still getting GPT-5.5-class coding. For consultancies and fintechs that have spent two years explaining why their AI coding assistant ships code to a third-party API, an open-weight model they host themselves is an easier story to tell a risk committee.

The honest cost picture cuts both ways. Hosted M3 at $0.60 / $2.40 per million tokens is already a fraction of the closed bracket, so self-hosting only wins on pure economics at sustained, high volume — you are trading a per-token fee for GPU capital, utilisation risk and an operations burden. The non-economic reasons — residency, control, no third-party data exposure — are frequently the ones that actually decide it.

Pro tip

Start hosted, then graduate workloads in-house. Run M3 through MiniMax's own endpoint or an inference aggregator first to validate quality on your tasks at near-zero capital. Once a workload's traffic is steady and predictable, model the break-even against GPU rental in your region — that is the point at which moving it onto your own infrastructure starts to pay, and not before.

Every article here is written by a Verified Builder. Want your name on the next one?

AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.

Become a Verified Builder →

Where M3 sits in the open-weight field

M3 does not arrive into a vacuum. The 2026 open-weight field is crowded with strong releases, and the right comparison is not closed-versus-open but which open model fits which job. DeepSeek V4 leads on raw coding throughput for many teams; Qwen 3.5 ships under a permissive Apache 2.0 licence that simplifies commercial deployment; Gemma 4 and Llama 4 anchor the broadly-supported, easy-to-deploy end of the spectrum. We covered the broader shift in the enterprise coding-stack war and the practical assistant pairing in the 2026 AI coding stack.

Model Standout axis Licence posture Context
MiniMax M3 Coding + long context + multimodal in one checkpoint Open weights 1M (MSA)
DeepSeek V4 Coding throughput Open weights Long
Qwen 3.5 Permissive commercial deployment Apache 2.0 Long
Gemma 4 / Llama 4 Ecosystem and deployment breadth Open weights Varies

M3's claim is breadth. If your workload needs frontier coding and a million-token context and multimodal input — an agent that reads a repo, a design mock and a long spec in one pass — M3 lets you do it with a single self-hostable model rather than orchestrating three. If you only need one of those axes, a more specialised open-weight model may serve you better and cheaper.

So — what should a builder actually do?

Treat M3 as a trigger to re-run a decision you may have settled prematurely. A self-hostable open-weight frontier model changes your stack in three concrete situations: when residency or sovereignty is a hard requirement; when sustained high-volume traffic makes per-token economics matter; and when the data you process is too sensitive to send to a third party at all. Outside those three, a hosted call — to M3 or a closed model — is usually still the pragmatic choice.

The pragmatic sequence: validate M3 on your own replay set first, decide per-workload rather than per-organisation, and let residency and control — not just the headline price — drive the self-hosting call. The fact that you now have the option, at coding parity with a closed flagship and open weights you can run in Mumbai or London, is the actual news.

Primary sources: MiniMax's launch write-up at minimax.io/blog/minimax-m3, coverage at venturebeat.com, and hosting and inference detail at together.ai.