Where the gap stands now

For most of the past two years the honest answer to "should we use an open model?" was "not for anything that matters". The quality gap between downloadable weights and the closed frontier was wide enough that, for a serious product, a frontier API was the only defensible choice. That answer is now genuinely contested, and the change has happened fast.

According to Artificial Analysis' Intelligence Index, Moonshot AI's Kimi K2.6 and Xiaomi's MiMo V2.5 Pro are the leading open-weight models, each scoring 54. That places them within roughly 3-6 points of leading proprietary models such as GPT-5.5, which the same index puts at 60. Across the wider field, Artificial Analysis' figures suggest open-weight quality has closed to within about 5-15 points of closed frontier models, while open-weight pricing has fallen roughly 30-60% across the board.

Two caveats before anyone reorganises a roadmap around those numbers. First, the Intelligence Index is a single-source composite benchmark — useful as a directional signal, not as ground truth for your specific workload. Treat the 54 and the 60 as a rough map, not a contract. Second, the headline figures conflate two distinct things: how good a model is, and how cheap it is to run. The pricing collapse is, for a cost-sensitive team, arguably the bigger story than the quality convergence. This piece is about what a builder should actually do with both.

A wave of releases, not a single model

The convergence is not the work of one breakout model. Since February 2026 there have been six major open-weight releases, and the cadence itself is the point — the field is now iterating fast enough that any single benchmark snapshot is stale within weeks.

  • DeepSeek V4-Pro and V4-Flash — a paired release, one tuned for capability and one for throughput.
  • Qwen3-Coder-Next — Alibaba's coding-specialised line, noted in roundups for efficiency per active parameter.
  • Qwen3.6-27B — a mid-size general model in a parameter range that fits comfortably on modest GPU hardware.
  • GLM-5 — Zhipu's flagship, with GLM-5.1 described in roundups as strong for long-horizon agentic engineering.
  • Kimi K2.6 — Moonshot AI's release, one of the two joint leaders on the Intelligence Index per Artificial Analysis.
  • Mistral Medium 3.5 — the European entry, and a useful option for teams with data-residency reasons to favour a European maker.

Separately, Zyphra released ZAYA1-8B, an Apache-2.0 mixture-of-experts reasoning model — 8B total parameters with only around 760M active per token — trained on AMD Instinct hardware. It is a smaller model and not a frontier contender, but it matters for two reasons: the Apache-2.0 licence makes it genuinely open rather than merely open-weight, and the AMD training run is a quiet signal that the hardware monoculture is loosening.

Watch out

Open-weight is not the same as open-source. Most of these releases ship downloadable weights under permissive-ish licences, but training data, training code and full reproduction recipes are usually withheld — and some licences carry usage restrictions, field-of-use limits or revocation clauses. ZAYA1-8B under Apache-2.0 is close to genuinely open; several of the others are not. Read the actual licence text before you build a commercial product on a model, because "you can download it" and "you can ship it" are different statements.

The models, side by side

The table below summarises the leading open-weight options and their reported strengths. The coding claims come from published roundups rather than a single benchmark authority, so treat them as directional. Where an Intelligence Index figure appears, it is attributed to Artificial Analysis and should be read as one benchmark's view, not a settled fact.

Model Maker Reported strength Note
Kimi K2.6 Moonshot AI Joint top open-weight model at 54, per Artificial Analysis; reported to lead on SWE-Bench Pro Strong all-rounder; the closest open-weight option to GPT-5.5 on the Index
MiMo V2.5 Pro Xiaomi Joint top open-weight model at 54, per Artificial Analysis Newer entrant; verify ecosystem and tooling maturity before committing
GLM-5 / GLM-5.1 Zhipu AI GLM-5.1 described in roundups as strongest for long-horizon agentic engineering Favour for multi-step agent workflows where the model holds a long task
DeepSeek V4-Pro DeepSeek Reported in roundups to lead on LiveCodeBench and 1M-context tasks Paired with V4-Flash, a throughput-tuned sibling for cheaper bulk inference
Qwen3-Coder-Next Alibaba Noted in roundups for best efficiency per active parameter Attractive when you are GPU-constrained and serving cost dominates
Mistral Medium 3.5 Mistral AI Competitive mid-tier general model European maker — relevant if data residency or supplier geography matters

The pattern worth noticing is that there is no single winner. Kimi K2.6 and MiMo V2.5 Pro top the general Index per Artificial Analysis, but on coding the picture fragments — GLM-5.1 for agentic depth, DeepSeek V4-Pro for long-context code, Qwen3-Coder-Next for cost efficiency. For a builder, that fragmentation is good news: you can match a model to a workload rather than accept one model's compromises everywhere. We went deeper on the coding-specific picture in our coding agent leaderboard for May 2026.

The build-vs-buy sum: when does self-hosting win?

This is the question that actually matters for a startup. A frontier API — GPT-5.5, Claude, Gemini — is a managed product: you pay per token, someone else runs the GPUs, patches the serving stack and absorbs the 3am incident. A self-hosted open-weight model is infrastructure you own: cheaper per token at volume, but only after you have priced in everything the API quietly handled for you.

The naive comparison looks only at per-token price, sees open-weight inference at a fraction of frontier API cost, and concludes self-hosting is obviously cheaper. That comparison is wrong often enough to be dangerous. The honest sum has four terms, and three of them are invisible on a pricing page.

  • Inference cost. The visible term. At high, steady volume, a self-hosted open-weight model on rented or owned GPUs can undercut a frontier API substantially — this is the real saving, and the one the headlines describe.
  • Operations burden. Someone has to stand up the serving stack, tune batching and quantisation, monitor latency and throughput, and be on call when an endpoint degrades. For a small team this is a meaningful slice of an engineer's week, indefinitely.
  • Eval drift. Each time you quantise more aggressively, patch a runtime, or swap a model version, output quality shifts. Without a standing evaluation harness you discover the regression in production. The harness itself is ongoing work.
  • Latency and reliability. A well-run frontier API gives you elastic capacity and a credible uptime record. A self-hosted endpoint gives you exactly the capacity you provisioned — handling a traffic spike means over-provisioning GPUs that sit idle the rest of the time.

Set against those costs is the 5-15 point quality gap. For some workloads that gap is irrelevant — classification, extraction, summarisation, routing, internal tooling. For others — a customer-facing assistant where the difference between a good and a great answer is the product — it is decisive. The first task is to be honest about which kind of workload you have.

Pro tip

Before you cost anything, run your own evaluation. Take 200-300 real requests from your production logs, run them through the open-weight candidate and your current frontier API, and have the team score the outputs blind. Intelligence Index rank does not predict task fit — a model two points lower on a composite benchmark may be indistinguishable on your workload, or noticeably worse. The eval is a day of work and it replaces a guess with a measurement.

A concrete decision framework

Here is the framework we would apply, equally relevant to a Bengaluru SaaS startup and a London fintech. It is deliberately a sequence of gates — fail any one and self-hosting is probably not yet for you.

Gate one: volume. Self-hosting only pays back when inference volume is high enough to keep GPUs usefully busy. If you are processing tens of millions of tokens a day, steadily, the maths can swing hard towards self-hosting. If your volume is low, spiky, or still finding product-market fit, a frontier API is almost certainly cheaper once operations time is priced in — and far less distracting. Spiky traffic is the quiet killer: provisioning for the peak means paying for idle GPUs at the trough.

Gate two: task tolerance. Can your workload absorb the 5-15 point quality gap? Run the blind eval from the pro tip above. If the open-weight model is within noise on your tasks, the gap is not a real cost. If your reviewers can consistently pick the frontier output, you are trading product quality for an infrastructure saving — a trade that is sometimes right, but should be made with eyes open, not by accident.

Gate three: operations capacity. Do you have, or can you hire, someone who can own a GPU serving stack? This is not a part-time responsibility. If your team is four engineers shipping a product, adding inference infrastructure ownership is a real opportunity cost — every hour on serving is an hour not on the product. A managed inference platform is a useful middle path here: you get open-weight models without owning the metal. We compared the main ones in our piece on inference platforms — DeepInfra, Together, Fireworks, Groq and Cerebras.

Gate four: data and compliance. Sometimes the decision is not about cost at all. If you handle data that, for regulatory or contractual reasons, cannot leave your own environment — health records, certain financial data, anything caught by India's DPDP Act or UK data-protection obligations — then self-hosting an open-weight model may be the only compliant option, and the cost comparison becomes secondary. In that case the question shifts from "is it cheaper?" to "which model clears our quality bar inside our perimeter?".

A pragmatic reading: most early-stage startups should start on a frontier API, instrument their token spend and latency carefully, and revisit self-hosting once volume is genuinely high and predictable. The exception is the compliance-driven case, where the perimeter requirement decides it regardless of volume. The hybrid pattern — frontier API for the hardest customer-facing calls, self-hosted open-weight for high-volume background tasks like classification and extraction — is where many teams will sensibly land. A further option is to fine-tune a smaller open-weight model on your domain, which can narrow the quality gap on a specific task; we covered the budget approach in our guide to fine-tuning an LLM on a budget with LoRA and QLoRA.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

The benchmark trap, and the licence trap

Two failure modes deserve a direct warning, because both are common and both are expensive.

The first is treating a benchmark score as a production guarantee. The Intelligence Index, SWE-Bench Pro, LiveCodeBench — these are valuable, but they measure performance on curated test sets, not on your data, your prompts, your latency budget or your edge cases. A model that scores 54 against another at 60 is not "10% worse at your job"; it might be identical on your workload, or it might fail in a way the benchmark never probes. Benchmark numbers should narrow your shortlist. They should never make your decision. The decision is made by an eval on your own traffic.

The second is the licence trap. "Open-weight" covers a wide spread of actual legal terms. Some licences are genuinely permissive; others restrict commercial use above a revenue threshold, forbid certain fields of use, require attribution in specific ways, or reserve the right to revoke. Building a product on a model and discovering the licence forbids your use case is a costly mistake to make late. The fix is cheap: have someone read the licence text — the actual text, not a blog summary — before the model goes anywhere near production, and record which licence each model in your stack ships under.

What this means for builders

The strategic shift is real and worth internalising. A year ago, choosing an open-weight model for a serious product meant accepting a clear quality penalty. Today, per Artificial Analysis' figures, that penalty has narrowed to something many workloads can absorb — and the 30-60% price collapse means the saving on the other side of the trade is substantial. The frontier labs no longer have a monopoly on "good enough".

For an Indian or UK startup, the practical takeaways are these. Keep a frontier API as your default while volume is low — the operational simplicity is worth the token price, and it lets the team stay focused on the product. Instrument your inference spend from day one, so that when volume grows you can run the build-vs-buy sum on real numbers rather than vibes. When you do evaluate self-hosting, evaluate on your own traffic, not on a leaderboard. Watch the licence terms as closely as the benchmark scores. And treat the hybrid model — frontier for the hard calls, open-weight for the bulk — as the likely destination rather than an all-or-nothing switch.

The gap has closed enough that "we use an open-weight model" is no longer a confession. It is now a defensible engineering choice — provided you make it with the full sum in front of you, not just the per-token price.

Artificial Analysis publishes its Intelligence Index methodology and current figures at artificialanalysis.ai — worth reading directly, with its single-source limitations in mind.