In the 12 days between 3 May and 14 May 2026, four frontier-class open-weight coding models dropped from Chinese AI labs: GLM-5.1 from Zhipu AI, MiniMax M2.7 from MiniMax, Kimi K2.6 from Moonshot AI, and DeepSeek V4 from DeepSeek. Each carries competitive coding benchmarks. Each is available under weights that can be self-hosted. And per the Air Street Press analysis circulated this week, inference on these models — whether via API or self-hosted — runs at under one-third the cost of Claude Opus 4.7 per token.
That is a structural shift, not a product cycle blip. To understand why four major models appeared in 12 days, you need to understand what is happening to Chinese AI capital and competitive pressure simultaneously. To act on it, you need to understand the trade-offs each model brings — and the genuine gaps they have not yet closed.
This piece covers both. It is written for builders who are responsible for model selection decisions and who need to make a defensible choice — not a fashionable one.
Why Four Models in 12 Days: The Competitive Dynamics
Chinese AI labs are operating under a specific combination of pressures that makes coordinated release windows almost inevitable. The first is capital: Moonshot AI (the company behind Kimi K2.6) raised approximately $2 billion at a valuation of approximately $20 billion in May 2026. MiniMax, Zhipu AI, and DeepSeek have each raised comparable rounds over the preceding 18 months. That capital has to be justified, and the primary vehicle for justification is model releases that demonstrate technical capability.
The second pressure is international credibility. Since DeepSeek R1's release in January 2025, Chinese AI labs have been under scrutiny from both technical and geopolitical angles. Releasing models with open weights under permissive licences is the most direct rebuttal to the narrative that Chinese AI cannot be trusted or verified. An Apache 2.0 model is auditable. You can read the weights. That transparency argument is commercially valuable for labs seeking partnerships with non-Chinese enterprises.
The third pressure is the release calendar of Western labs. When Anthropic releases Claude Opus 4.7 at a premium price point, or when OpenAI maintains GPT-5 at frontier pricing, there is a specific window of market opportunity for labs who can offer comparable coding performance at dramatically lower cost. That window does not stay open indefinitely — within 6 to 12 months, pricing pressure from the Western labs will likely narrow the gap. Releasing now, during the window, makes strategic sense.
The result is a predictable clustering effect: when one lab releases, the others accelerate their timelines to avoid being perceived as lagging. The 12-day window is not a conspiracy — it is a Nash equilibrium.
For broader context on the open-weight model landscape leading into this wave, see our April 2026 open-weight roundup covering Mistral, Llama, and the earlier GLM generation.
The Four Models: Side-by-Side
Here is what is known about each model as of the release date. Parameter counts for some models are not officially confirmed; estimates are based on inference VRAM requirements and architectural clues in the technical reports.
| Model | Lab | Licence | Parameters (est.) | Specialisation | Benchmark highlight | API cost est. (vs Claude Opus 4.7) |
|---|---|---|---|---|---|---|
| GLM-5.1 | Zhipu AI (Tsinghua spinout) | Apache 2.0 | ~70B (dense) | General coding, Python/Java strong | Leading HumanEval pass@1; strong on LiveCodeBench | ~28% of Opus 4.7 |
| MiniMax M2.7 | MiniMax | Open weights (research + commercial, check terms) | ~200B+ MoE (est.) | Long-context coding, large codebase review | Top-tier on extended context (128K+) code tasks; strong on SWE-Bench multi-file | ~25% of Opus 4.7 |
| Kimi K2.6 | Moonshot AI | Open weights (commercial licence, check terms) | ~72B (dense) | Multi-file coding, agentic code tasks | Competitive SWE-Bench Verified; strong agentic scaffold performance | ~30% of Opus 4.7 |
| DeepSeek V4 | DeepSeek | DeepSeek Open Model Licence (commercial with restrictions) | ~236B MoE (est.) | Frontier-cost coding and reasoning | Near-frontier on HumanEval, MBPP, LiveCodeBench; strongest of the four on reasoning-heavy code | ~22% of Opus 4.7 |
Benchmark numbers from the labs' own technical reports should be treated as upper bounds, not field performance. Labs optimise their evaluation pipelines. Run your own eval on the tasks that matter for your product before making a decision. The ranking above may not hold for your specific language, framework, or task complexity profile.
For context from the broader market: IBM released Granite 4.1 in the same window — a 3B/8B/30B family under Apache 2.0, optimised for enterprise coding and instruction following. It is a different market segment (smaller, deployable on-device or on modest hardware), but it signals that the open-weight coding model moment is broad, not narrow. Similarly, Grok 4.3 became available on Oracle Cloud Infrastructure Enterprise AI in May 2026, expanding the options for teams whose cloud procurement is locked to OCI.
DeepSeek V4: Context for Existing Readers
If you have been following this site, DeepSeek V4 is not a stranger. We covered the DeepSeek V4 Flash and Pro variants in detail in our DeepSeek V4 frontier-cost analysis, which examined the MoE architecture's cost advantages and the specific deployment configurations that make it viable at production scale. The V4 release in this 12-day window is a continuation of that trajectory — further benchmark improvements and expanded weight availability — rather than an architectural departure. If you are already evaluating DeepSeek V4, that earlier analysis remains the relevant operational guide. This article focuses on the comparative picture across all four models and the decision framework for choosing between them.
When to Actually Switch From a Closed Frontier Model
The cost argument is compelling. At 22–30% of Claude Opus 4.7's per-token cost, the economics are genuinely different — not marginally different. For a product spending $10,000 per month on inference, switching to one of these models (with equivalent quality) saves $7,000 to $7,800 per month. That is real money, particularly for bootstrapped teams and early-stage startups.
But "equivalent quality" is doing a lot of work in that sentence. Here is the framework for when a switch is defensible:
Switch when you have run your own eval and the quality delta is below your threshold. "Below your threshold" is product-specific. For an internal developer tool generating boilerplate, a 5% quality degradation on edge cases may be acceptable. For a product where a code error causes a financial or safety event, it is not. Do not rely on published benchmarks for this determination — benchmark on your data, your tasks, your users' query distributions.
Switch for internal tooling first. CI/CD code review automation, internal documentation generation, test scaffolding, PR description drafting — these are the ideal first deployment targets for open-weight models. The blast radius of a failure is contained. The cost saving is real. And you build operational familiarity with the model's behaviour before it touches customers.
Do not switch for regulated-sector customer-facing products without a safety layer. Healthcare, legal, financial services: the safety evaluation gap is real and is discussed in detail below. These sectors require not just quality benchmarks but safety evaluations, model cards with detailed capability and limitation documentation, and often certifiable audit trails. None of the four models meets that bar without additional engineering.
Consider a routing architecture rather than a full switch. A model router — simple classifier or embedding-based — can direct straightforward queries to the cheaper open-weight model and complex or sensitive queries to the frontier closed model. This is explored in our inference cost optimisation deep dive. A well-tuned router can capture 60–80% of the cost saving while preserving quality on the queries that most need it. See also the principles behind DoRA fine-tuning if you need to specialise an open-weight model further for your domain.
Practical Deployment: Self-Hosted vs API
There are three realistic deployment paths for these models: API via a managed provider, self-hosted on cloud GPU instances, or self-hosted on on-premises hardware. Each has a different cost structure, operational overhead, and data residency profile.
API via Managed Providers
Together.ai, Hugging Face Inference Endpoints, and Groq each host subsets of these models with per-token billing. This is the lowest-friction path: no infrastructure management, instant start, pay-as-you-go.
- Together.ai: Hosts DeepSeek V4 and GLM-5.1 with competitive pricing. Good option for prototyping and for workloads where you do not need to guarantee data residency.
- Hugging Face Inference Endpoints: Supports all four models with dedicated endpoint provisioning. You can specify the cloud region, which matters for GDPR and DPDP compliance. Dedicated endpoints mean no cold starts and predictable latency.
- Groq: The fastest inference available for models it supports (currently DeepSeek variants). Latency is dramatically lower than GPU-based inference — relevant for interactive coding assistants where response time affects user experience.
For UK teams with GDPR obligations: Hugging Face Inference Endpoints supports EU region deployment (eu-west-1, eu-west-3). Specify this explicitly when creating your endpoint. Together.ai does not currently offer EU data residency — check their data processing agreement before using user data in production queries.
Self-Hosted on Cloud GPU Instances
For higher-volume workloads, self-hosting on spot or on-demand GPU instances via RunPod, JarvisLabs, or AWS/GCP/Azure becomes cost-competitive with managed APIs — and gives you full control over data flow.
VRAM requirements by model (at 4-bit quantisation via AWQ or GPTQ):
# Approximate VRAM requirements (4-bit quantised)
GLM-5.1 (~70B): ~40 GB → 1× H100 80GB, or 2× A100 40GB
Kimi K2.6 (~72B): ~42 GB → 1× H100 80GB, or 2× A100 40GB
DeepSeek V4 (~236B MoE, active ~22B): ~48 GB → 1× H100 80GB (MoE active params)
MiniMax M2.7 (200B+ MoE, est.): ~56 GB → 1× H100 80GB (tight), safer on 2× A100
# Serving stack recommendation
pip install vllm # v0.5+ supports DeepSeek MoE and GLM architectures
# Example: Launch GLM-5.1 on H100 with FP8 quantisation
vllm serve THUDM/GLM-5.1-72B \
--quantization fp8 \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.92 \
--port 8000
MoE models like DeepSeek V4 and MiniMax M2.7 have a counterintuitive VRAM profile: the total parameter count is large (200B+), but only a fraction of parameters are active per forward pass (the "active parameters" are typically 20–25% of total). This means their actual inference VRAM footprint is closer to a 40–60B dense model. Benchmark throughput before assuming you need more hardware than the numbers suggest.
A working deployment with Ollama for local development or single-node testing:
# Ollama (local dev / staging) — smaller quantised versions
ollama pull deepseek-v4:32b-q4_K_M
ollama run deepseek-v4:32b-q4_K_M
# For GLM-5.1 via Ollama (when available in registry)
ollama pull glm5.1:72b-q4_K_M
# Test a coding prompt
ollama run deepseek-v4:32b-q4_K_M \
"Write a Python function that merges two sorted linked lists iteratively."
Self-Hosted On-Premises
For Indian enterprise teams with DPDP Act compliance requirements, and for UK enterprise teams with strict data residency mandates, on-premises deployment is the only route that gives a clean compliance story. The hardware requirement for a 70B dense model at FP8 is a single H100 80GB or two A100 40GB nodes — both now available through enterprise hardware distributors in India and the UK at declining prices. For MoE models like DeepSeek V4, a single H100 80GB is viable at the active-parameter footprint.
Quality vs Cost Trade-offs: Where These Models Still Lag
The cost case is compelling. The quality case is more nuanced. Here are the genuine gaps, as of May 2026:
Instruction following on complex, multi-constraint prompts. All four models exhibit occasional instruction drift — following two of three constraints in a complex system prompt reliably, but missing the third. This is particularly pronounced on prompts that combine format requirements, domain constraints, and safety guidelines simultaneously. Claude Opus 4.7 and GPT-4o still lead on this dimension. For products where your system prompt carries critical constraints (safety rules, output format requirements, brand voice guidelines), test this explicitly.
Safety and refusals. The safety training on all four Chinese models is lighter than on frontier Western models. They are more likely to comply with edge-case requests that Anthropic or OpenAI models would decline. This is a double-edged quality: useful for technical tasks where aggressive refusals are a nuisance, and a liability for consumer-facing products. The emergent misalignment research published earlier this year is relevant here — fine-tuning an open-weight model on domain-specific data can inadvertently degrade its safety properties in ways that are not immediately obvious.
English vs Chinese text quality. All four models are primarily trained on Chinese-heavy corpora. English coding performance is strong (code is largely language-agnostic), but English prose generation — documentation, commit messages, code comments, technical explanations — shows subtle quality differences versus models trained on English-dominant corpora. For products where the quality of generated English text is customer-visible, benchmark this specifically.
Reasoning on novel problem types. On established coding benchmark categories, these models perform at or near frontier. On genuinely novel problems — unusual algorithm combinations, domain-specific constraints with no training examples — the gap to frontier models widens. This matters less for most product use cases (most product code is not novel) but is worth knowing if your use case is unusual. For a comparison with how open-source reasoning models approach this, see our coverage of Gemma 4's thinking modes.
Ecosystem and tooling maturity. Claude and GPT-4o have mature structured output support, function calling reliability, and JSON mode implementations that have been hardened by millions of production deployments. These open-weight models are newer — function calling and structured output work, but less edge cases have been found and fixed. Expect to do more defensive output parsing.
Indian and UK Builder Perspective
The calculus for Indian and UK teams differs in interesting ways that are worth making explicit.
Indian startups: cost constraint is structural. For bootstrapped or seed-stage Indian startups, frontier API costs are not a minor line item — they are often the difference between a viable unit economics model and a fundraising dependency. At 22–30% of Claude Opus 4.7's cost, these open-weight models unlock product economics that simply were not available 12 months ago. The combination of Together.ai API access (no infrastructure overhead) with GLM-5.1 or DeepSeek V4 is a viable path for a pre-Series A team that cannot justify a $15,000/month inference bill.
For Indian enterprise teams — the larger banks, IT services firms, government-adjacent projects — the on-prem data residency story is equally important. DPDP Act compliance is increasingly a procurement requirement, and a self-hosted GLM-5.1 on a rack in a Mumbai data centre provides a cleaner story than any cloud API. The Apache 2.0 licence on GLM-5.1 is particularly important here: it removes the legal friction around running a Chinese model in an Indian enterprise context, because the weights are auditable, the licence is unambiguous, and there is no dependency on a foreign vendor's API terms.
UK teams: procurement friction is the barrier. For UK enterprise teams, the cost argument is secondary to procurement. Getting a Chinese-origin AI model through an enterprise procurement process — security review, vendor due diligence, legal review of data handling — is a non-trivial exercise, even for a model with open weights. The Apache 2.0 licence helps, but procurement teams will still ask questions about supply chain, about whether the weights might be updated to include backdoors (they cannot be, if you downloaded them and serve them locally), and about reputational risk.
The practical path for UK enterprise is the same as for Indian enterprise: self-host the weights, run on UK-region infrastructure, and present it as "our model" in procurement conversations rather than "a Chinese model via API". That framing is accurate, and it sidesteps the vendor risk conversation entirely. For UK SMBs and scale-ups, the managed API route via Hugging Face Inference Endpoints in eu-west-1 is cleaner — the data processing agreement is with Hugging Face (a US-French company with EU infrastructure), not directly with the Chinese lab.
For UK dev agencies and consultancies: using these models to power internal code review tooling, automated test generation, and documentation drafting creates an immediate competitive advantage on delivery speed and cost. The internal tooling context removes the procurement barrier (no customer data, no DPA needed) and gives you operational experience to inform client conversations about open-weight deployment.
Red Flags to Check Before Production Deployment
Before you put any of these models into a production pipeline, work through this checklist:
1. Read the actual licence file on the Hugging Face repository. Do not rely on marketing summaries. The licence for MiniMax M2.7 and Kimi K2.6 includes commercial use permissions but also specific restrictions that may affect your use case — particularly around redistribution and API productisation. GLM-5.1's Apache 2.0 is the cleanest; DeepSeek's licence has specific clauses about competing products. Read the full text before signing off a production deployment.
2. Review the model card for known failure modes. All four models have model cards, but they vary significantly in completeness. DeepSeek's documentation is the most thorough. GLM-5.1's is adequate. Kimi K2.6 and MiniMax M2.7 are thinner. If the model card does not describe known failure modes on adversarial inputs, treat that as a gap you need to fill with your own testing.
3. Run a red-teaming suite on your specific use case. Generic safety evals from the lab do not tell you whether the model is safe for your product. If your product involves code execution, test for prompt injection via code comments. If it involves user-provided input being included in code prompts, test for exfiltration attempts. The emergent misalignment research is a useful reference for what can go wrong even with models that pass standard safety evals.
4. Check the training data provenance. All four labs have stated that their training data does not include specific prohibited categories, but independent verification is not possible without weight analysis. If your deployment context requires certifiable training data provenance (certain government contracts, financial sector requirements), open weights do not currently provide sufficient documentation for that bar. This is a gap across the entire open-weight ecosystem, not specific to Chinese labs.
5. Add an output filter layer. For any customer-facing deployment, run outputs through a lightweight classifier that flags potential harms or policy violations before returning to the user. The cost of an output classifier is small relative to inference cost; the risk of unfiltered outputs reaching customers is not. Build this layer once and reuse it across models — it also gives you the flexibility to swap underlying models without re-engineering your safety architecture.
Do not deploy any of these models in a customer-facing context based solely on the labs' published benchmark scores and model cards. The safety evaluation gap is real. The instruction-following reliability on complex multi-constraint prompts is real. These are engineering-tractable problems — output filters, eval suites, routing architectures — but they require deliberate work. Shortcuts here create production incidents that are expensive to fix and reputationally costly.
The Broader Picture: What This Wave Means
Four frontier-class open-weight coding models in 12 days is not the end of the trend — it is an acceleration of a pattern that has been visible since DeepSeek R1 arrived in early 2025. The trajectory is clear: frontier coding capability is becoming commoditised at the open-weight layer, and the cost of inference at that capability level is falling faster than most product roadmaps anticipated.
The implication for builders is not "switch to Chinese models immediately". It is "re-examine every closed-API dependency in your stack and build a version that works at open-weight cost". That exercise will reveal that some dependencies are worth keeping (safety, reliability, ecosystem maturity), and others are not (cost per token for tasks where quality is indistinguishable).
The western open-source ecosystem is responding. IBM Granite 4.1 (Apache 2.0, 3B to 30B) is the most recent example — enterprise-grade, permissive licence, purpose-built for instruction following and code. The competition between Chinese and Western open-weight labs is, ultimately, good for builders in India and the UK: it drives down costs, improves quality, and creates genuine alternatives to the small number of frontier closed-model providers.
Act on that competition now. Run the evals. Instrument your costs. Build the routing layer. The window where this cost advantage is large is open. It will not stay open indefinitely as the frontier labs respond with their own pricing adjustments.
For the practical deployment economics behind making this switch financially, the numbers are laid out in our inference cost profitability guide. For the fine-tuning layer that can specialise an open-weight model for your domain while preserving its general capabilities, the DoRA fine-tuning deep dive is the relevant starting point.
Switching your inference stack? Connect with Builders who have done it.
AI Tech Connect's verified Builder network includes ML engineers who have migrated production systems from closed to open-weight models. Browse profiles to find expertise in your stack.
Browse Builders →