What changed in April 2026

Eighteen months ago, anyone shipping a serious AI product picked GPT-4 or Claude and called it a day. April 2026 is the month that defaulting to proprietary stopped being the obvious choice. Three open-weight drops — Llama 4, Mistral Small 4, and GLM-5.1 — now match or beat closed-weight alternatives on the benchmarks that map to real production work, while running at a fraction of the inference cost when self-hosted.

  • GLM-5.1 from Zhipu AI now leads open-source on SWE-bench Pro and Terminal-Bench — the two benchmarks that correlate most cleanly with coding-agent quality.
  • Llama 4 Scout ships natively multimodal with text, images, and video, plus a Mixture-of-Experts architecture that lets you run a 109B-parameter model at 17B-active inference cost.
  • Mistral Small 4 dropped on roughly 16 March 2026 — a 119B-parameter MoE under permissive Apache 2.0 licensing, the only major-lab release with no licence asterisks.

And that is before counting the supporting cast: Gemma 4 from Google, where a 27B-parameter model now beats Llama-405B, DeepSeek-V3 and o1-mini on LMArena; Alibaba's Qwen 3.6; and DeepSeek V4, which shipped on 24 April — see our separate deep dive at DeepSeek V4 lands: Pro flagship beats GPT-4 cost at 88.5% MMLU.

Pro tip

Both Together AI and Replicate now expose prompt caching with the same semantics as proprietary APIs. Set a stable system-prompt prefix, batch your requests by tenant, and you can pull cached-read pricing as low as $0.03/MTok on Together for GLM-5.1. The 5-minute TTL is the only number that actually matters for production economics — design your queueing around it.

GLM-5.1 — the SWE-bench champion

GLM-5.1 from Zhipu AI is the model that will quietly steal coding-agent workloads from Claude and GPT-4 over the next two quarters. It leads the open-source pack on SWE-bench Pro and Terminal-Bench, the two benchmarks the Builder community has gradually accepted as the most honest predictor of real coding performance.

The numbers we have seen on internal repos line up with the public scoreboard. Where Claude Sonnet might solve 18 of 20 patches first try on a mid-sized TypeScript repo, GLM-5.1 lands 16 — and at roughly a fifth of the inference cost when called via Together's hosted endpoint. The remaining gap is mostly in tool-orchestration depth on agentic tasks longer than three or four turns, which Anthropic still leads.

Licensing sits under the GLM Open Use Agreement, which is permissive for commercial use but does require attribution and prohibits training competing foundation models on its outputs. For 99% of product-builder use cases, those clauses are ignorable.

Llama 4 Scout — the long-context efficient one

Meta's Llama 4 ships in two MoE variants: Scout (109B total / 17B active, 16 experts) and Maverick (~400B total). Scout is the one most teams should be looking at — it is natively multimodal across text, images, and video, and the 17B-active architecture means inference cost lands closer to a 20B dense model than its full parameter count would suggest.

The catch is the licence — covered in detail below. But assuming you are not within an order of magnitude of the 700M-MAU threshold, Scout is the model to reach for when you need long-context multimodal grounding and want to self-host on commodity GPUs. A single H100 80GB node runs Scout at production latency, which puts it in reach of any team that can swing a Hetzner or Lambda Labs reservation.

Mistral Small 4 — the European Apache 2.0 winner

Mistral Small 4 dropped around 16 March 2026: 119B parameters in MoE configuration, full Apache 2.0 licence, no commercial caveats. That last sentence is genuinely rare.

If your workload involves European data, regulated industries, or any context where licence-attorney sign-off is on the critical path, Mistral Small 4 is the path of least resistance. It is also the natural choice for Builders shipping into the UK who want a Europe-domiciled model provider — Mistral runs inference out of Paris and Frankfurt, which simplifies certain GDPR analyses that get awkward with US-domiciled hosted inference.

On raw capability, Mistral Small 4 sits a hair below GLM-5.1 on coding and a hair above Llama 4 Scout on multilingual reasoning. On the European languages — French, German, Italian, Spanish, Portuguese — it is class-leading.

The comparison table

One-page reference for the four open-weight models that matter this quarter. Inference cost is hosted-endpoint blended (Together for the first three, Google AI Studio for Gemma 4) — self-hosting on AWS Mumbai (ap-south-1) or AWS London (eu-west-2) typically halves these numbers above ~50M tokens per day.

Model Total / active params Context Headline benchmark Licence Hosted cost (in/out)
Llama 4 Scout 109B / 17B (16 experts) 10M tokens Multimodal LMArena top-tier Llama Community $0.20 / $0.60 MTok
Mistral Small 4 119B MoE 256k tokens Class-leading multilingual Apache 2.0 $0.30 / $0.90 MTok
GLM-5.1 Mixed config 200k tokens SWE-bench Pro, Terminal-Bench leader GLM Open Use $0.20 / $0.80 MTok
Gemma 4 27B dense 128k tokens Beats Llama-405B / DeepSeek-V3 / o1-mini on LMArena Gemma terms $0.10 / $0.30 MTok

Where each one wins

Three concrete workload patterns we have watched teams successfully migrate this past month:

Long-context document analysis — Llama 4 Scout

Scout's 10M-token context is the headline number, but the more useful figure is the recall curve. Up to roughly 2M tokens, Scout pulls cross-document references at parity with Claude Opus 4.7 — and at perhaps a tenth of the cost when self-hosted on H100s. UK fintechs running compliance review across thousands of pages of FCA correspondence have started routing this category of work to Scout exclusively.

Coding agents — GLM-5.1

If you are building a coding agent or paying for one, GLM-5.1 is the open-weight default. The pattern most teams settle on is: route short-turn completions to GLM-5.1 via Together; reserve Claude Sonnet for the multi-step planning phase where tool orchestration depth still matters. The cost delta on simple-edit traffic is large enough to fund the routing logic three times over.

Multilingual production — Mistral Small 4

Indian Builders shipping bilingual chat and content tools (English plus Hindi, Tamil, Bengali, Marathi) report Mistral Small 4 outperforming Llama 4 Scout on Indic languages by a noticeable margin, despite Meta's larger Indic training data claims. UK Builders shipping into European markets get the same benefit on French and German. The Apache 2.0 licence is the cherry on top.

Licence compatibility — read this before you deploy

This is the section nobody enjoys but every product team needs.

Watch out — Llama Community Licence

Llama 4's licence is not Apache 2.0. The Llama Community Licence forbids use by any product or service whose monthly active users exceed 700 million, requires attribution ("Built with Llama"), and contains an acceptable-use policy that can be amended unilaterally by Meta. The 700M-MAU threshold sounds enormous, and for most teams it is — but it is the kind of clause that makes acquisition diligence painful, and it constrains your future strategic options. If you want zero licence friction, use Mistral Small 4 instead.

The short version of the three licences:

  • Apache 2.0 (Mistral Small 4) — the gold standard. Permissive, attribution-only, no MAU caps, no acceptable-use policy controlled by the issuer. Just ship it.
  • Llama Community Licence (Llama 4) — workable for almost everyone, but the MAU clause and the unilaterally amendable acceptable-use policy mean it is technically not OSI-approved open source. Treat it as licensed software, not free software.
  • GLM Open Use Agreement (GLM-5.1) — permissive for commercial use, requires attribution, prohibits training competing foundation models on the outputs. Effectively the same shape as Apache 2.0 for product-builder use cases.

Deployment patterns — India and UK regions

The four deployment options Builders are using in production right now, ranked roughly by ease of getting started:

  1. Hosted endpoints (Together, Replicate, Modal) — zero ops work, $0.20 to $0.90 per million tokens depending on model. Together has the best coverage of all three open-weight contenders. Latency from Mumbai is acceptable but routes via Singapore; from London latency is good direct.
  2. Hugging Face Inference Endpoints — slightly higher cost than Together but with easier private VPC peering on AWS. Useful if your data sovereignty story requires the inference workload to never leave your account.
  3. Self-host on AWS — for India, AWS Mumbai (ap-south-1) on a g5.12xlarge or p5.48xlarge running vLLM. For the UK, AWS London (eu-west-2) on the same instance shapes. Crossover point versus hosted endpoints sits around $3,000-$4,000 of monthly inference spend.
  4. Self-host on GCP — GCP asia-south1 (Mumbai) or europe-west2 (London) with TPUs is competitive on Mistral Small 4 and Gemma 4 in particular. Llama 4's MoE is currently better-optimised for NVIDIA, so AWS wins for that one.

A minimal self-hosted vLLM config that works for GLM-5.1 on a single H100 80GB:

# GLM-5.1 on a single H100 80GB via vLLM
# Tested on AWS p5.48xlarge in ap-south-1 (Mumbai) and eu-west-2 (London)

docker run --gpus all --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model THUDM/glm-5.1-chat \
  --tensor-parallel-size 1 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --served-model-name glm-5.1

# Then call it from Python with the OpenAI SDK pointed at localhost:
#   client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
#   client.chat.completions.create(model="glm-5.1", messages=[...])

The same image runs unchanged on GCP A3 instances and on bare-metal H100s at any colo. The --enable-prefix-caching flag is the cost-economics one — with it, you get the same prompt-caching benefit you would get from Together's hosted endpoint. Without it, every request pays full input-tokens cost and your unit economics tilt.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

When NOT to use open-weight

Open-weight is not the right answer for every workload. Three failure modes worth flagging:

  1. Safety-evaluation gaps — proprietary labs invest more heavily in red-teaming and refusal training. For regulated medical advice, child-facing products, or any context where a single bad output carries asymmetric legal risk, the closed-weight models are still the safer default.
  2. Refusal patterns drift on fine-tunes — once you fine-tune an open-weight model, the safety behaviour you inherit from the base model will degrade, sometimes silently. Budget for a dedicated safety-eval pass after every fine-tune; if you cannot, do not fine-tune.
  3. Multi-tool agentic tasks at long horizons — past about 100k tokens of multi-tool context, frontier proprietary models still pull ahead noticeably. The gap is closing, but for the most demanding agent workloads, Claude or GPT remain the right default for now.
From a verified Builder

"We migrated our automated PR-review bot from Claude Sonnet to GLM-5.1 over four weekends in March. Inference cost dropped from £4,200 to £680 per month. The rate of false-positive review comments went up by maybe 8%, but the rate of true-positive catches stayed identical — and our reviewers actually appreciated the slightly more aggressive tone. We are running it self-hosted on a reserved H100 in AWS London. Would do it again."

— Vikram, Verified Builder · Pune, IN

Migration plan — moving one workload off Claude Sonnet to GLM-5.1 in four weeks

If you are convinced enough by the benchmarks to want to test this on a real workload, here is the four-week plan most teams are running. We have watched roughly a dozen Builders execute some version of this in March and April.

  1. Week 1 — pick one workload, instrument it. Choose a single endpoint or job — ideally a stateless one with clear inputs and outputs. Add structured logging that captures prompt, completion, latency, and a quality signal (user thumbs, downstream success metric, or human-rated sample). You need ground truth before you can compare.
  2. Week 2 — shadow-traffic GLM-5.1 alongside Claude. Fork the request stream — Claude continues serving production, GLM-5.1 runs in shadow with its responses logged but not returned to users. After a week, compare the two output streams on your quality signal. Most teams find the gap is within their own measurement noise.
  3. Week 3 — flip a percentage of traffic. Start at 10% routed to GLM-5.1, watch the dashboards for two days, then 50%, then 100% if the quality signal holds. Keep Claude on standby behind a feature flag — if regressions appear in production that the shadow week missed, you can instantly roll back.
  4. Week 4 — decide on the long-term home. If hosted Together is fine, leave it there. If you are above ~50M tokens/day and the unit economics justify the work, spin up self-hosted vLLM on AWS Mumbai or AWS London using the config above. Either way, document the licence terms in your data-protection register so the next legal review does not panic.

The teams who have done this say the actual surprise is not the cost saving — it is how few quality regressions show up in production. Open-weight has caught up to "good enough" for the long tail of workloads, and the engineering effort to confirm that for any specific workload is now measured in weeks rather than quarters.

Primary sources worth bookmarking: Mistral's Small 4 release post, Meta's Llama 4 announcement, and the GLM-5.1 GitHub release.