What changed

On 2 April 2026, Google released Gemma 4, the latest in its open-weight family and a pointed answer to a year of pressure from Mistral, Meta and the Chinese open-weight labs. The headline numbers: a 26-billion-parameter mixture-of-experts model that activates only 4 billion parameters per token, configurable chain-of-thought depth at inference time, native function calling, and a clean Apache 2.0 licence. It is, by some distance, the most builder-friendly open release Google has shipped.

For teams in Bengaluru, Pune, London or Manchester running self-hosted inference on consumer GPUs — or trying to avoid the per-token bills of frontier closed models — Gemma 4 is the first open-weight checkpoint that takes reasoning seriously without demanding data-centre hardware. The combination of sparsity and tunable thinking is what makes it interesting. Neither feature is unique on its own; together, in a permissively licenced package, they are.

Pro tip

Treat the 26B/4B-active number as two separate budgets. Total parameters dictate VRAM or system-memory footprint. Active parameters dictate per-token compute and therefore latency. A 24GB consumer card can host the full 26B at 4-bit quantisation while only paying the compute price of a 4B model on each token.

What's actually in the box

Gemma 4's architecture is a mixture-of-experts transformer. The weights total roughly 26 billion parameters, split across multiple expert modules. A learned router decides, for each input token, which subset of experts to activate — and Google has tuned that routing so the active count is 4 billion parameters per token. That is the standard sparse-MoE recipe popularised by Mixtral and refined by DeepSeek and the Llama 4 family.

On top of the MoE backbone, Gemma 4 ships three features that matter for production builders.

Configurable thinking modes. The model can be instructed at inference time to perform shallower or deeper chain-of-thought reasoning. This is exposed as a parameter rather than a separate checkpoint — one set of weights, two operating regimes. For a customer-support classifier you turn thinking down; for a multi-step research agent you turn it up. We get into what this means in practice further down.

Multimodal input with variable aspect ratio and resolution. Gemma 4 accepts images at native shape — no forced square crops, no fixed 224×224 or 448×448 buckets. For document understanding, OCR-in-the-loop pipelines, and screen-grab agents, this is a quietly significant change.

Built-in function calling. Tool use is a first-class capability rather than something bolted on with a separate fine-tune. You can wire Gemma 4 into agent frameworks without an additional adapter step.

Wrap all of that in Apache 2.0 — the same licence that ships with Mistral's open releases — and you have a model that an enterprise legal team will sign off on without a long memo. That alone closes the gap with Llama 4 and Mistral Small 4 on commercial deployability.

Why a 26B/4B-active MoE matters for inference economics

The economics of self-hosting are dominated by two costs: hardware (capital) and electricity (operating). A dense 26B model would price both cheap consumer GPUs and most small-team budgets out of contention. Activate only 4B per token, however, and the per-query compute cost collapses while you keep the capacity benefits of the larger total parameter count.

Concretely: a 4B-active forward pass uses an order of magnitude less compute than a dense 26B forward pass. On a single 24GB consumer card — an RTX 4090 in London or a refurbished 3090 in Hyderabad — that translates to dramatically higher tokens-per-second throughput. Coverage of the release positions Gemma 4 explicitly as targeting "high-throughput reasoning on consumer GPUs", and the architectural maths supports that framing.

The numbers below are illustrative ranges based on architectural reasoning, not benchmarks — exact figures depend on your runtime, quantisation level and sequence length. Benchmark on your own workload before committing.

Deployment shape Hardware What you get Best fit
Single consumer GPU RTX 4090 / 3090 (24GB) Full 26B at 4-bit, room for context Solo developer in India, small UK studio
Mid-tier consumer GPU RTX 4080 / 4070 Ti (16GB) 3-bit quant or partial offload Dev workstation, prototype agent
Workstation card RTX 6000 Ada / A6000 (48GB) Higher precision, longer context Boutique consultancy, in-house research
Data-centre node Single H100 / A100 (80GB) FP16, batching, multi-tenant Mid-sized SaaS, internal platform

For the cost of a single high-end consumer GPU — roughly the price of three months of frontier closed-model usage at moderate scale — you can host Gemma 4 indefinitely. That maths is what makes the release interesting for cost-sensitive markets like India and for UK startups that have to defend every pound of cloud spend to a board.

Configurable thinking — what 'tunable depth' means in practice

Reasoning models have, until now, mostly come in two flavours: always-on chain-of-thought (DeepSeek-R1, OpenAI o1) or always-off (everything else). The always-on models pay a steep latency and token-cost penalty even on trivial queries. The always-off models leave reasoning capacity on the table when you actually need it.

Gemma 4's configurable thinking is the obvious middle path: one model, two operating modes, switched per request. In practice this means an inference-time parameter that scales how much hidden chain-of-thought the model produces before its visible answer. Lower depth means cheaper, faster output. Higher depth means more deliberate multi-step reasoning at higher token cost.

The builder mental model is simple. Wire your application's task router so that simple, deterministic tasks (classification, formatting, retrieval-grounded Q&A) hit Gemma 4 at low thinking depth, and reasoning-heavy tasks (multi-hop research, code review, planning) hit it at high depth. You stop paying the reasoning tax on the 80% of queries that do not need it.

Watch out

Tunable thinking is not free reasoning. Higher depth means more output tokens generated internally before the visible answer — which means proportionally higher latency and higher compute cost. If you turn thinking up for every request by default, you have effectively bought yourself a slower, more expensive model. Routing logic matters as much as the model choice.

Google's release notes on Hugging Face are the canonical source for the exact parameter name and acceptable values — the runtime ecosystem (Transformers, vLLM, llama.cpp via GGUF) will surface this differently in each tool, so check the model card for your stack of choice.

Where Gemma 4 sits in the 2026 open-weight landscape

The open-weight landscape changed dramatically in the first quarter of 2026. We covered the broader picture in our April round-up of Llama 4, Mistral Small 4 and GLM-5.1; Gemma 4 slots into that picture as the configurable-reasoning entry. The architectural and licence comparison, without making leaderboard claims that have not been independently verified, looks like this.

Model Architecture Licence Distinctive bet
Gemma 4 (Google) 26B MoE, 4B active per token Apache 2.0 Configurable thinking, consumer-GPU reasoning
Llama 4 Scout / Maverick (Meta) MoE at multiple scales Llama 4 Community Licence Ecosystem maturity, fine-tune scaffolding
Mistral Small 4 Dense Apache 2.0 Predictable latency, French sovereignty narrative
DeepSeek (open releases) MoE, large total / small active Permissive (varies by release) Aggressive cost-per-token, strong reasoning lineage
GLM-5.1 Dense / MoE variants Permissive (varies) Long-context, multilingual coverage

The two columns where Gemma 4 wins outright are licence clarity (Apache 2.0 with no extra-licence riders) and architectural ergonomics (the sparsity ratio plus the thinking switch). It does not necessarily win on raw quality — that is a benchmark question we are intentionally not answering until the community has reproducible numbers — but for builders making a build-versus-buy call, those two columns are usually the deciding ones.

Editorial observation

Apache 2.0 plus configurable reasoning is the licence-and-feature combination self-hosted teams have been waiting for. UK public-sector projects can finally ship a reasoning agent without a procurement team flagging the model licence as a risk — and the fact that it runs on GPUs many studios already own is the bit that actually closes the deal.

— AI Tech Connect editorial observation

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Builder playbook: when to pick Gemma 4 over a closed model

Reach for Gemma 4 first when one or more of these conditions applies.

  1. You need data residency. India's DPDP Act and the UK's procurement instincts both push hard towards on-premises or sovereign-cloud deployments for sensitive workloads. Self-hosted Gemma 4 is a clean fit; closed APIs are not.
  2. Your unit economics demand a fixed cost. Per-token billing is unpredictable; a single GPU's electricity bill is a flat line. For high-throughput, low-margin product surfaces — content moderation, log triage, in-app chat — Gemma 4 wins on TCO once volume crosses a threshold.
  3. You want function calling without an adapter step. The built-in tool-use support means you skip the fine-tune layer that earlier open models often needed before they would behave inside an agent framework.
  4. Your task mix is bimodal. If your traffic splits roughly between simple classification and harder reasoning, the thinking switch lets one deployment serve both without a model fleet.
  5. You need to ship to constrained environments. Edge deployments on workstations, air-gapped customer sites, or inside a UK NHS trust's network — these are places where the only realistic path is an open-weight model on hardware you control.

And when to stay on a closed flagship

Gemma 4 does not eliminate the case for closed models. If you need the absolute frontier of capability — the longest context windows, the most sophisticated tool-use, the freshest training data — closed flagships from Anthropic, OpenAI and Google's own Gemini line still lead. If you cannot afford the operational cost of running and updating your own inference stack, an API call is cheaper than a platform team. And if your traffic is bursty and unpredictable, paying per-token avoids the capital cost of idle GPUs.

Where it falls short

It is worth being honest about what Gemma 4 does not solve.

  • Benchmarks not yet independently reproduced. Google's release notes will claim numbers; the community has not had time to verify them across diverse prompts. We have intentionally avoided citing specific MMLU or HumanEval figures here for that reason. Treat any figure you see in the first fortnight as provisional.
  • MoE quirks. Mixture-of-experts models can show uneven quality across domains depending on which experts the router favours. If your workload is in a niche language or domain, hand-test before committing.
  • Deployment complexity. Running an MoE efficiently on consumer hardware is harder than running a dense model. Quantisation choices, runtime selection (vLLM versus llama.cpp), and routing-table memory all matter. Plan for an engineering investment.
  • Configurable thinking is a knob, not magic. Higher depth costs latency and tokens linearly. The win is in routing — wire the knob to your task type, do not just turn it up.
  • Multimodal still trails specialist vision models. Variable aspect ratio is welcome, but for OCR-heavy or fine-grained vision tasks, a dedicated VLM is still likely to outperform a general-purpose reasoning model with vision bolted on.

None of these are deal-breakers, but they are the places where a careful eval will save you from a regrettable migration. Run a small, representative benchmark of your actual workload before swapping out an existing closed-model deployment.

For the broader open-weight context, see our April 2026 round-up. For background and editorial standards, our Verified Builder directory is where the practitioners running these models in production live; if you are one of them, you can add your profile in a few minutes. External references for this piece include the 2026 open-weight cheat sheet, the LLM-stats update tracker, and Hugging Face's earlier overview of open-source LLMs.