What you need to know about Llama 4's architecture

Llama 4 is Meta's most significant open-weight release since the original Llama sparked an entire ecosystem. Both Maverick and Scout are built on a mixture-of-experts (MoE) architecture — a design choice that changes the economics of inference in a way that matters enormously for builders working outside hyperscaler budgets.

In a standard dense transformer, every parameter participates in every forward pass. Inference cost therefore scales linearly with total parameter count. MoE breaks this relationship. The model is split into many specialised sub-networks called "experts". For each token, a learned router selects a small subset of experts to activate — in Llama 4's case, 17 billion parameters out of a much larger total weight. The rest of the network sits idle during that token's computation.

The practical consequence is striking. Maverick has approximately 400 billion total parameters but the compute cost per token is determined by the 17B it activates, not the 400B it stores. Scout has approximately 109 billion total parameters with the same 17B active budget. Both models therefore deliver per-token compute economics that are broadly comparable to a 17B dense model, while potentially drawing on the accumulated knowledge stored across their much larger full weight set.

This is why MoE has become the dominant architecture for frontier open-weight releases in 2026. The capacity of a large model, the inference cost of a small one — provided you can afford to store and load the full weights. That storage cost is where Maverick and Scout diverge, and where your hardware planning begins.

Pro tip

Think of MoE models as having two separate hardware budgets: a storage budget (proportional to total parameters — determines how many GPUs you need) and a compute budget (proportional to active parameters — determines throughput and latency). Maverick and Scout have the same compute budget; their storage budgets differ by roughly 4x. Plan GPU count against storage, plan latency against active params.

Maverick vs Scout: which one is right for your project?

The choice between Maverick and Scout is fundamentally a hardware-versus-capability trade-off. Both are available in base and instruction-tuned variants on Hugging Face Hub under the meta-llama organisation. The table below covers the key decision factors.

Dimension Llama 4 Maverick Llama 4 Scout
Total parameters ~400B ~109B
Active parameters per token 17B 17B
Architecture MoE transformer MoE transformer
Minimum GPU setup (4-bit quant) 3–4× 80GB H100 / A100 1–2× 80GB H100 / A100
Best use case Production quality-critical tasks, complex reasoning, long-context summarisation Cost-sensitive deployments, prototyping, moderate-complexity tasks
Cloud cost profile Higher — requires multi-GPU node Lower — single-GPU H100 viable
Benchmark scores See model card on Hugging Face Hub See model card on Hugging Face Hub
Licence Llama 4 Community Licence Llama 4 Community Licence

The architectural parity in active parameters means that if your bottleneck is inference speed, you will not see a dramatic difference between the two. The difference shows in output quality on harder tasks — particularly multi-step reasoning, nuanced instruction following, and domain-specific generation — where Maverick's larger total weight pool gives it more to draw on.

For most teams starting out, Scout is the sensible default. You can prototype your entire pipeline, optimise your prompts and application logic, and then scale to Maverick when you have evidence that the quality ceiling matters for your specific workload. Running both in A/B tests before committing to infrastructure spend is a reasonable strategy.

Getting Maverick and Scout from Hugging Face

Meta has published both base and instruction-tuned weights for Maverick and Scout on Hugging Face Hub under the meta-llama organisation. Access is gated — you must request and receive approval before downloading. The process is straightforward but not instant.

Step 1: Accept the licence. Navigate to the model repository on Hugging Face Hub (for example, meta-llama/Llama-4-Scout-17B-Instruct). Click the "Access repository" button, read and agree to the Llama 4 Community Licence terms. Meta typically processes requests within a few hours for individual developers; enterprise requests may take longer if you include a use-case description that triggers manual review.

Step 2: Authenticate your CLI. Install the Hugging Face Hub CLI and log in with your user token:

pip install huggingface_hub
huggingface-cli login
# paste your token from huggingface.co/settings/tokens

Step 3: Download the weights. Use huggingface-cli download or the Python SDK. For Scout Instruct:

huggingface-cli download meta-llama/Llama-4-Scout-17B-Instruct \
  --local-dir ./llama4-scout-instruct \
  --local-dir-use-symlinks False

Step 4: Run inference with vLLM. vLLM is the recommended serving runtime for MoE models at production scale, offering efficient expert-routing and tensor parallelism. Install it and start a server:

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./llama4-scout-instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --port 8000

Once the server is running, query it with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="llama4-scout-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain mixture-of-experts in three sentences."}
    ],
    max_tokens=512,
    temperature=0.7
)
print(response.choices[0].message.content)

For Maverick, increase --tensor-parallel-size to match your GPU count. The Hugging Face model card for each variant includes recommended runtime flags and known quantisation compatibility — check it before finalising your configuration.

Watch out

Llama 4's MoE architecture uses a non-standard expert-routing implementation that is not yet supported in all versions of Transformers and vLLM. Before downloading, check the "Updates" section of the relevant Hugging Face model card for the minimum supported version of each library. Attempting to load the weights with an incompatible version will produce an error or, worse, silently incorrect output. Pin your dependencies and test on a small prompt before deploying to production.

Licence terms summary. The Llama 4 Community Licence permits commercial use, fine-tuning, and redistribution of derivative works for most teams. Key restrictions: you cannot use Llama 4 outputs to train a competing foundation model, you must preserve Meta's attribution notices, and organisations with more than 700 million monthly active users require a separate commercial licence from Meta. For the vast majority of builders in India and the UK, the Community Licence is sufficient. Read the full text on the model repository and loop in your legal team if you are deploying at scale.

Running inference efficiently

Deploying a MoE model efficiently is meaningfully different from deploying a dense model of equivalent active parameter count. The routing overhead, expert memory locality, and tensor-parallelism strategy all interact in ways that can leave significant performance on the table if you do not account for them.

Quantisation options. For most teams, 4-bit quantisation (GPTQ or AWQ) is the practical starting point. It roughly quarters the weight storage requirement without catastrophic quality loss on most tasks. For Maverick at 4-bit, expect a footprint of roughly 200GB; for Scout at 4-bit, roughly 55GB. FP16 and BF16 are the gold standard for quality and are viable if your hardware budget allows — Maverick in BF16 requires around 800GB of VRAM, which is a multi-node A100/H100 deployment. BF16 Scout at ~218GB is reachable on three 80GB H100s.

Batching strategy. MoE models benefit particularly from request batching because larger batches make better use of the expert-routing parallelism. A batch size of 1 (single request at a time) leaves much of the architectural advantage of MoE unrealised. If your use case allows it, aim for a batch size of 8 or higher at serving time to see the throughput gains that MoE is designed to deliver.

Tensor parallelism. For Maverick, you will need to shard the model across multiple GPUs. vLLM's --tensor-parallel-size flag handles this. Set it to match the number of GPUs in your node. Expert parallelism (a related technique that distributes individual experts across GPUs) is increasingly supported in vLLM — check the release notes for the version you are using.

Cloud cost comparison. For teams in India, E2E Networks and Yotta Data Centres offer H100 and A100 instances priced competitively against AWS and Azure for sustained workloads. A single H100 80GB on E2E Networks is broadly comparable in monthly cost to the AWS ap-south-1 (Mumbai) equivalent, with the advantage of lower egress fees for India-destined traffic. For UK deployments, AWS eu-west-2 (London) is the default, but dedicated GPU cloud providers (CoreWeave, Lambda Labs) often offer better per-hour pricing for committed usage. Run a cost comparison at your expected tokens-per-day volume before committing to a provider — the spread between best and worst options at MoE scale is significant.

Shipping something with Llama 4? Get found by teams hiring.

AI Tech Connect is the directory where Indian and UK AI Builders are discovered. Add your profile — free at launch.

Browse Builders →

The shifting open-source landscape

Llama 4's release lands in a dramatically different open-weight ecosystem from the one Llama 2 entered. Meta no longer dominates the community in the way it once did — and Hugging Face's Spring 2026 State of Open Source report makes this explicit.

According to Hugging Face's own analysis, the most-liked models on the Hub have shifted from a US-dominant Llama-heavy top 10 to an international mix. The current number-one most-liked model on Hugging Face Hub is DeepSeek-R1 — a Chinese open-weight model that triggered a remarkable ripple effect when it went viral in January 2025. That single release from DeepSeek demonstrated that a non-US lab could produce a frontier-calibre open model at a fraction of the presumed compute cost, and it permanently altered the competitive dynamic.

The Spring 2026 report confirms the trend has continued. Chinese labs, European institutions, and a growing cohort of researchers from India and South-East Asia are contributing top-ranked models across multiple categories. The open-source AI ecosystem is now genuinely global in a way that it was not even eighteen months ago.

For builders in India and the UK, this matters in two ways. First, the diversity of strong open-weight options has increased: Llama 4 is not the only serious choice, and the April 2026 round-up of Llama 4, Mistral Small 4, and GLM-5.1 covers the competitive field. Second, the quality bar for what "open-source" means has risen. Models that were considered research previews two years ago are now production-ready. The Gemma 4 release with configurable thinking modes is a good illustration: an Apache 2.0 model from Google that targets reasoning on consumer GPUs — a capability that would have been unimaginable in open-weight form in early 2024.

Llama 4 enters this landscape as the ecosystem's default choice — the model most developers reach for first, with the widest tooling support and the most active fine-tuning community. But it now competes on merit against a genuinely strong international field, not by default. For builders, that competition is good news: it means the open-weight options at every price point and capability level are better than they have ever been.

What builders should consider before deploying

Before you commit to a Llama 4 production deployment, work through these six considerations.

Licence review. The Llama 4 Community Licence is commercially permissive for most teams, but it is not open-source in the FSF sense. Read the full licence text on the Hugging Face repository. If your product is a model-as-a-service API, a foundation model training pipeline, or you expect to exceed 700 million monthly active users, get legal advice before proceeding.

Data privacy and residency. Running Llama 4 self-hosted means your inference traffic never leaves your infrastructure — a significant advantage for applications handling personal data under India's DPDP Act or the UK GDPR. If you are on a cloud provider, confirm the data stays in your chosen region. AWS eu-west-2 and ap-south-1, E2E Networks, and Yotta all offer region-locked deployments.

Instruction-tuned versus base. The instruction-tuned variants (Llama-4-Maverick-17B-Instruct and Llama-4-Scout-17B-Instruct) are designed for conversational and task-following use cases and are the right starting point for the vast majority of applications. The base weights are intended for researchers and fine-tuners who want to apply their own instruction tuning. Do not use base weights in production without your own safety and instruction-following fine-tune.

Fine-tuning strategy. If you intend to fine-tune on proprietary data, plan for LoRA or QLoRA adapters rather than full fine-tunes — the latter are computationally prohibitive at Maverick scale for most teams. The Hugging Face PEFT library and Axolotl are the community's standard tools. Scout is a more practical fine-tuning target than Maverick for teams without dedicated ML infrastructure. Fine-tuned adapters trained on Scout's weights can be tested quickly and the lessons applied to Maverick if quality justifies the step up.

Evaluation before deployment. Do not rely on the benchmark scores on the model cards as a proxy for performance on your task. Construct a small but representative evaluation set from your actual workload — 100 to 500 examples with human-verified answers — and run both Maverick and Scout through it. The results will often surprise you, and the investment is trivial compared to the cost of a failed production migration.

Monitoring and updates. Meta releases new versions, patches, and fine-tuned variants regularly. Establish a process for tracking Hugging Face model card updates and testing new releases against your eval set before promoting them to production. Running an old version without reviewing the changelog is an operational risk as the community discovers and reports quality regressions or safety issues.