How much does LLM distillation actually shrink model size?

In practice, a well-executed distillation project typically produces a student that is 4–10× smaller than the teacher while retaining 85–95% of domain-task accuracy. For example, distilling a 70B teacher for a narrow classification or extraction task often yields a 7B or 14B student that matches the teacher on that specific workload. The ceiling depends on task complexity: narrow, well-scoped tasks (single-domain Q&A, structured extraction, code completion in one language) see the highest retention rates. Tasks requiring broad reasoning, extensive world knowledge or multi-step planning typically lose more quality at aggressive compression ratios.

What is the difference between SFT-on-outputs distillation and feature-level distillation?

SFT-on-outputs (also called black-box or data distillation) uses the teacher purely as a data generator: you run inference on your task dataset, collect the teacher's text completions, filter for quality, then fine-tune the student on those (prompt, completion) pairs using standard supervised fine-tuning. It works with any teacher, even a closed-weights API model like GPT-4o or Claude 3.5 Sonnet, and is straightforward to implement with Axolotl or TRL. Feature-level distillation (also called white-box or on-policy distillation) requires access to the teacher's probability distributions or intermediate activations. The student is trained to match the teacher's token-level log-probabilities — not just its final text — which provides a richer training signal. TRL's GKD trainer implements this. It delivers a higher quality ceiling but requires that teacher and student run simultaneously, which means you need the teacher weights in memory or a remote teacher server, and both teacher and student must share a compatible vocabulary.

Which student model should I start with in 2026?

For most tasks in 2026, Qwen 3 8B (dense) or Gemma 4 12B are the safest starting points. Both have strong base capabilities, wide community support in Axolotl and TRL, and deliver excellent throughput on a single A10G or L40S GPU. Qwen 3 8B is the best choice if you need multilingual coverage or if your deployment will be on hardware with tight VRAM constraints. Gemma 4 12B is the better pick for tasks where reasoning depth matters at the 8B–12B scale. Llama 4 Scout (17B active, 109B total) requires multi-GPU serving and is better suited to benchmarks than narrow production tasks. For extremely cost-sensitive deployments, Qwen 3 1.7B and Gemma 4 4B are viable targets if your task distribution is narrow and you are willing to invest in a larger, higher-quality training set.

How do I know when distillation has failed and the student has collapsed?

Student collapse shows up in four ways. First, a sudden drop in domain eval metrics that does not track with any corresponding drop in perplexity — the student is memorising the training set but not generalising. Second, the student starts producing the teacher's stylistic quirks rather than correct answers — this is called teacher-style overfitting. Third, performance on your held-out edge-case set degrades significantly even when average accuracy is stable, indicating the student has learned the easy majority but is brittle on tail inputs. Fourth, MMLU or HumanEval scores regress sharply relative to the base model — this suggests you have overfit to your narrow task distribution and the model has lost general capability. Always run a full eval suite including domain tasks, general benchmarks and a hand-curated edge-case set before declaring a distillation successful.

When does distillation ROI beat routing and caching?

Distillation beats routing and caching when your traffic is narrow and high-volume. Routing works best on mixed traffic — if 40% of queries need a frontier model, routing handles that gracefully, but you are still paying frontier prices for that 40%. Distillation eliminates that ceiling: once the student is trained, all traffic goes to the cheap model. The break-even point depends on your monthly token volume and the teacher API cost: at 500M output tokens per month on a task where a distilled 7B student achieves parity with a $3/M-token teacher, the annual saving from distillation exceeds £50,000 and the engineering investment pays back in under three months. Below 100M tokens per month, caching and routing almost always beat distillation on pure ROI because the upfront training cost dominates.

LLM Distillation in Production: Shrink Your Model, Keep Quality

Why distillation exists: the three-way cost trade-off

By mid-2026, every team running LLMs at meaningful scale has at least three tools for managing inference costs: caching, routing, and distillation. They are not interchangeable. Each targets a different root cause of overspend, and choosing the wrong one for your workload is how projects deliver disappointing ROI despite months of engineering effort.

Caching targets repetition. If the same prompt prefix, the same retrieved context, or the same answer appears repeatedly, caching stops you paying full inference cost on the repeated portion. It is the safest lever — done correctly, it reduces cost without altering the model or the answer. The limit is that it only helps workloads with structural repetition. A high-variety task — where every input is genuinely distinct — gets almost nothing from caching. For the full caching and routing playbook, see our companion guide on routing and semantic caching.

Routing targets variety. It classifies each incoming query by complexity and sends cheap traffic to a budget model and hard traffic to a frontier model. It is excellent when your workload has a wide spread of difficulty — perhaps 60% simple queries and 40% hard ones — because it means you are not spending frontier money on simple work. The limit is that you still need the frontier model for the hard 40%, and if that fraction is large or unpredictable, routing does not eliminate your frontier bill; it merely reduces it.

Distillation targets the narrow-workload case. If your application does one thing repeatedly — classifies support tickets, extracts structured data from a fixed schema, generates SQL from natural language — and the distribution of that task is stable, distillation can train a small specialist model to replace the frontier model entirely for that workload. The savings are complete: once the student is deployed, the frontier API is no longer called for that use case at all.

The decision tree is therefore: start with caching if you have repetition; add routing if you have mixed-difficulty traffic; invest in distillation only if you have a narrow, high-volume, stable-distribution workload where a 7B–14B model can realistically match frontier quality on the specific task. If you are still deciding whether fine-tuning is right at all, the fine-tuning decision ladder is a useful prerequisite read.

Distillation vs routing vs caching: when each wins
Method	Best condition	Worst condition	Infrastructure change
Caching	High prompt/response repetition	Every input is unique	Minimal — add cache layer
Routing	Wide difficulty spread in traffic	Uniform difficulty — all hard or all easy	Moderate — add classifier + model registry
Distillation	Narrow task, high volume, stable distribution	Broad multi-task workload, mixed distribution	High upfront — train + serve new model

Two distillation approaches: outputs vs features

Once you have decided that distillation is the right lever, there is a second decision: what exactly does the student learn from? There are two fundamentally different approaches, and they have different tooling requirements, quality ceilings and practical complexity.

Approach 1: SFT on teacher outputs (black-box distillation)

The most widely deployed approach in production in 2026 is also the simplest conceptually. You use the teacher model as a data generator: run inference on your task dataset, collect the teacher's completions, filter for quality, and then train the student on those (prompt, completion) pairs using standard supervised fine-tuning. The teacher is a black box — you only need its text outputs, not its weights or internal states.

This approach works with any teacher, including closed-weights API models. Meta used exactly this approach to help train Llama 4 Scout and Maverick using outputs from the much larger Behemoth model. DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based students using the same technique. Google leveraged Gemini-family outputs during Gemma 2 and Gemma 3 development. The pattern is industry-standard.

The practical tooling is straightforward: generate outputs with the teacher API or local inference, write them to a JSONL file in the correct chat format, and train the student with Axolotl or TRL. The quality ceiling is moderate — the student learns to reproduce teacher outputs, but it does not learn the teacher's uncertainty or the probability mass the teacher placed on alternative tokens.

Approach 2: Feature-level / on-policy distillation (white-box)

Feature-level distillation — sometimes called on-policy distillation (OPD) or white-box distillation — trains the student to match the teacher's token-level probability distributions, not just its final text. Instead of treating the teacher's output as ground truth, the student minimises a divergence (typically forward or reverse KL divergence) between its predicted distribution and the teacher's at each token position.

This is a richer training signal. The teacher's distribution encodes uncertainty: when the teacher assigns 0.4 probability to token A and 0.3 to token B, the student learns that both are plausible, rather than treating the sampled token A as the only correct answer. This leads to a higher quality ceiling, particularly for reasoning tasks. TRL's GKDTrainer implements Generalised Knowledge Distillation, which adds on-policy sampling so the student is trained on sequences it actually generates — correcting the distribution mismatch that plagued earlier KL-distillation approaches.

The catch is access: you need the teacher's logits, which means you need the teacher's weights running simultaneously during training — or a compatible remote teacher server. You also need the teacher and student to share the same vocabulary (or a compatible mapping), which rules out cross-architecture scenarios. TRL's 2026 DistillationTrainer addresses the throughput problem with a generation buffer that decouples training batch size from generation batch size, delivering up to 40× speedup over naive implementations.

Distillation approach decision guide
Factor	SFT on outputs	Feature-level (GKD/OPD)
Teacher access required	Text outputs only (API is fine)	Full weights + matching vocabulary
Quality ceiling	Moderate — reproduces teacher behaviour	Higher — matches teacher uncertainty
Training complexity	Low — standard SFT pipeline	High — concurrent teacher inference needed
Recommended for	Most production cases; closed teacher APIs	Reasoning tasks; open-weight teacher families
Primary tooling	Axolotl, TRL SFTTrainer, LLM Foundry	TRL GKDTrainer / DistillationTrainer

Practical guidance: For your first distillation project, default to SFT on teacher outputs. It is easier to debug, works with any teacher (including proprietary APIs), and delivers production-ready results for the majority of narrow-task workloads. Only invest in feature-level distillation if you have already shipped SFT-on-outputs and found it cannot close the quality gap for your task.

Choosing your student model: Llama 4, Gemma 4, Qwen 3

Selecting the right base model for your student is as important as the training recipe. In 2026 the open-weight landscape has consolidated around three dominant families: Meta's Llama 4, Google's Gemma 4, and Alibaba's Qwen 3. Each has distinct characteristics that affect both training cost and inference economics.

The general principle is to choose the smallest model that can realistically be trained to your target quality. A model that is twice as large requires roughly twice the VRAM, roughly twice the GPU hours to train, and at least twice the inference cost per token to serve. The right starting point for most production distillation projects in 2026 is the 7B–14B range.

Student model comparison for production distillation (as of June 2026; verify model availability)
Model	Active params	Min VRAM (FP16)	Single-GPU serving	Best for
Qwen 3 1.7B	1.7B dense	~4 GB	Yes (any consumer GPU)	Highest-volume, narrowest tasks; latency-critical edge deployments
Qwen 3 8B	8B dense	~16 GB	Yes (A10G, L4)	Recommended default; multilingual; strong reasoning at the 8B scale
Gemma 4 12B	12B dense	~24 GB	Yes (L40S, A100 40GB)	Reasoning-heavy tasks; best quality ceiling in the sub-14B range
Qwen 3 14B	14B dense	~28 GB	Yes (A100 40GB) with quantisation	Step up from 8B if domain eval gap requires it
Llama 4 Scout	17B active / 109B total MoE	~70 GB	No — multi-GPU required	Not recommended as a student — use as teacher only

Llama 4 Scout's MoE architecture makes it an attractive benchmark number but a poor student model choice: all 109B parameters must be resident in memory even though only 17B are active per forward pass, requiring multi-GPU serving that eliminates the inference cost savings that motivated distillation in the first place. Use Scout or Maverick as a teacher, not a student.

For multilingual workloads — relevant for both Indian-language applications and UK teams processing non-English content — Qwen 3 8B is the recommended default. Its training corpus has significantly better coverage of Hindi, Bengali, Tamil and other regional languages compared to the Llama and Gemma families at the same scale. For the self-hosting infrastructure decision that underpins your serving choice, see the self-host vs API inference guide.

Generating and filtering teacher outputs: quality over quantity

The quality of your teacher-generated dataset is the single largest determinant of distillation success. A student trained on 5,000 high-quality examples consistently outperforms one trained on 50,000 noisy ones. The generation and filtering stage deserves as much engineering attention as the training itself.

Step 1: Assemble your task dataset

Start with your real production inputs. Pull a representative sample from your request logs — at minimum 10,000 examples, ideally 50,000 or more for non-trivial tasks. Deduplicate aggressively: near-duplicate inputs in the training set are waste at best and source of overfitting at worst. Cluster the inputs and verify that your sample covers the full input distribution, including the tail cases your model handles least well in production. Those edge cases are disproportionately valuable in training data.

If you do not yet have production logs (you are building a new capability), you can generate synthetic inputs with a frontier model. Prompt the teacher to produce varied examples across the full expected distribution: different lengths, different complexity levels, different domain sub-topics. Synthetic input generation is slower and more expensive than mining logs, but it gives you control over distribution that log mining cannot.

Step 2: Generate teacher completions

Run each input through the teacher to produce a completion. For SFT-on-outputs distillation, use temperature 0 for deterministic, high-confidence outputs on tasks with a single correct answer (extraction, classification, structured generation). Use temperature 0.7–1.0 if you want the student to learn the teacher's natural variation (summarisation, open-ended generation). For reasoning tasks, chain-of-thought completions significantly outperform direct-answer completions as training targets — the student learns the reasoning path, not just the conclusion.

Budget the generation cost carefully. If your teacher is a frontier API model at $3 per million output tokens and you are generating 50,000 completions at an average of 400 tokens each, the generation cost is $60. That is negligible relative to training compute. For 500,000 completions at 1,000 tokens, the cost is $1,500 — still modest, but worth modelling before you start.

Step 3: Quality filter

Not all teacher outputs are equally good. Apply at minimum three filters before writing the dataset to disk.

Length filter: discard completions that are suspiciously short (likely refusals or degenerate outputs) or abnormally long (likely hallucination spirals). Set thresholds at the 2nd and 98th percentile of your expected output length distribution.

Format filter: for structured-output tasks (JSON extraction, code generation), programmatically validate that the output is parseable. A teacher that occasionally produces malformed JSON will train the student to do the same. Discard any completion that fails validation.

LLM-as-judge filter: for tasks where correctness is harder to verify programmatically (reasoning, summarisation, QA), run a second frontier model as a judge. Prompt it to score each (input, completion) pair on a 1–5 scale and discard pairs scoring 3 or below. This adds cost — roughly the same as the generation cost — but is the most reliable quality gate for subjective tasks.

Do not skip the quality filter. Training on teacher refusals, malformed outputs or low-quality completions is a leading cause of student collapse. A 15% reduction in training set size after filtering consistently produces a better student than training on the unfiltered set.

Write the final dataset in JSONL format with one example per line. Axolotl natively supports the ShareGPT and Alpaca formats; the ShareGPT format is recommended for chat-tuned student models.

{
  "conversations": [
    {
      "from": "human",
      "value": "Extract the invoice number, date and total from the following text:\n\n'Invoice #INV-20481 dated 8 June 2026, total due £1,240.00'"
    },
    {
      "from": "gpt",
      "value": "{\"invoice_number\": \"INV-20481\", \"date\": \"2026-06-08\", \"total\": 1240.00, \"currency\": \"GBP\"}"
    }
  ]
}

Training recipe: QLoRA on Axolotl

Axolotl is the recommended training framework for production distillation in 2026. Its YAML-driven configuration system, broad model support (Qwen, Gemma, Llama, Mistral, and more), and active maintenance (versions v0.28 and v0.29 shipped in early 2026 with FSDP2 integration and stable multimodal support) make it the lowest-friction path from dataset to trained adapter.

For a 7B–14B student on a task-specific dataset, QLoRA is the right training approach. It quantises the base model weights to 4-bit during training (reducing VRAM requirements by roughly 4×) while keeping the LoRA adapter weights in full precision. This makes it possible to train a 14B student on a single A100 80GB GPU, or a 7B student on an A10G 24GB GPU — hardware that is available on demand from any major cloud provider for £1.50–£2.50 per GPU-hour.

Axolotl configuration

Below is a production-ready Axolotl YAML configuration for distilling a Qwen 3 8B student. Adjust base_model, dataset paths and sequence length to your specific task.

base_model: Qwen/Qwen3-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
strict: false

# Dataset — ShareGPT format, JSONL
datasets:
  - path: /data/teacher_outputs_filtered.jsonl
    type: sharegpt
    conversation: chatml

dataset_prepared_path: /data/last_run_prepared
val_set_size: 0.02
output_dir: /output/qwen3-8b-distilled

# Sequence length — set to the 95th percentile of your data
sequence_len: 2048
sample_packing: true

# QLoRA adapter config
adapter: qlora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Training hyperparameters
num_epochs: 3
micro_batch_size: 4
gradient_accumulation_steps: 4    # effective batch size = 16
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05
optimizer: paged_adamw_8bit
weight_decay: 0.01
max_grad_norm: 1.0

# Logging and evaluation
logging_steps: 25
eval_steps: 200
save_steps: 200
save_total_limit: 3
load_best_model_at_end: true

# Flash Attention 2 for memory efficiency
flash_attention: true

# Weights and Biases experiment tracking (optional but recommended)
wandb_project: distillation-qwen3-8b
wandb_run_id: run-001

Key hyperparameter choices

LoRA rank (lora_r): 64 is the recommended default for distillation tasks. Lower ranks (16–32) are sufficient for simple classification or extraction tasks. Higher ranks (128) are worth trying if your domain eval accuracy plateaus at 64 and you have enough training data. For more on weight-decomposed LoRA variants, see the DoRA fine-tuning explainer.

Learning rate: 2e-4 with a cosine schedule works well for QLoRA on most tasks. If you see training loss spikes or the model starts producing incoherent outputs mid-training, drop the learning rate to 1e-4 and increase the warmup ratio to 0.1.

Epochs: 3 epochs is a reasonable default. Monitor validation loss carefully: if it starts rising before 3 epochs complete, stop early. Distillation tasks with small datasets (under 10,000 examples) are particularly prone to overfitting at epoch 3.

Sample packing: enable this unless your task has very short inputs. It concatenates multiple training examples into a single sequence up to sequence_len, eliminating padding waste and significantly improving GPU utilisation.

Training a Qwen 3 8B model for 3 epochs on 50,000 examples takes approximately 4–6 hours on a single A100 80GB GPU at the configuration above. At spot instance pricing (approximately $1.80/hour on major clouds), the compute cost is £8–£12 — a rounding error in the total distillation budget.

Evaluation: domain accuracy, benchmarks, latency

A distilled model is only ready for production when it has passed a structured evaluation suite. The evaluation has three distinct layers, and all three must pass before you replace the teacher in production.

Layer 1: Domain task accuracy

The primary evaluation is accuracy on your actual task. Build a held-out test set of at least 500 examples — ideally 2,000 — that was not used in training or validation. For structured-output tasks, evaluate with exact-match or programmatic correctness checks. For generation tasks, use a combination of ROUGE scores and an LLM-as-judge evaluation. The critical metric is accuracy delta versus teacher: what percentage of the student's answers match the teacher's answer or are judged equal-or-better quality.

For production sign-off, a common threshold is 95% of teacher accuracy on domain tasks. If the student achieves 90%, it may still be acceptable if the cost saving is dramatic — but that is a business decision that should be made explicitly, not as a side-effect of not measuring carefully enough.

Layer 2: General capability regression check

Fine-tuning a model for a narrow task can cause catastrophic forgetting — degradation of general capabilities that were present in the base model. Always run MMLU and HumanEval (or a subset of each) before and after training to verify that general language understanding and code generation have not regressed significantly. A 2–3 point MMLU regression is typical and acceptable; a 10+ point regression indicates the training process has damaged general capability and you should investigate before deploying.

Run the evaluation on both the base model checkpoint and the final distilled checkpoint. Track the delta, not the absolute score — what matters is how much the training changed the model.

Layer 3: Latency benchmarks

A cheaper model that is too slow creates a different class of production problem. Measure latency at the serving configuration you plan to use in production, under a realistic concurrency load. The metrics to track are:

Latency evaluation metrics for production distillation
Metric	What it measures	Typical target
Time to first token (TTFT) p50/p95	Responsiveness for interactive use cases	<200 ms p50 for single-GPU 8B model
Tokens per second (throughput)	Sustained generation speed	80–150 tok/s for 8B on A10G with vLLM
End-to-end latency p50/p95	Total round-trip for a typical request	Task-dependent — compare to teacher p50/p95
Cost per 1K output tokens	Economics at scale	Compare self-hosted ÷ 1,000,000 to API price

vLLM with PagedAttention is the recommended serving framework. On an A10G GPU (24 GB VRAM) serving a quantised Qwen 3 8B model, you can expect 80–120 tokens per second at typical prompt lengths, with TTFT under 200 ms at moderate concurrency. The throughput scales with GPU count if you tensor-parallelise, though for 7B–8B models a single GPU is usually the economically optimal configuration.

Run your eval suite as a CI step. Every time you retrain or update the adapter, the eval suite should run automatically and block deployment if domain accuracy drops below your threshold. Manual eval gates are consistently missed under deadline pressure. Instrument your training pipeline with Weights and Biases and wire the eval metrics to a deployment gate from the start.

Production deployment: vLLM, adapter merging, rollout

Getting the distilled model into production involves three decisions: how to serve it, whether to merge the adapter, and how to roll it out safely.

Serving with vLLM

vLLM is the production standard for self-hosted LLM serving in 2026. Its PagedAttention algorithm, continuous batching, and multi-GPU tensor parallelism deliver the throughput that production workloads require. For a single 8B distilled model, the serving command is straightforward:

# Serve a merged distilled model with vLLM
vllm serve /output/qwen3-8b-distilled-merged \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --served-model-name qwen3-8b-distilled

# Or serve with an unmerged LoRA adapter (for multi-adapter setups)
vllm serve Qwen/Qwen3-8B-Instruct \
  --enable-lora \
  --lora-modules distilled=/output/qwen3-8b-distilled-adapter \
  --max-loras 4 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85

vLLM's Multi-LoRA feature allows you to serve a single base model with up to 200 concurrent adapter variants, each adding only megabytes of memory overhead rather than the gigabytes a separate model deployment would require. This is particularly useful if you are distilling several task-specific students from the same base — serve the base once, attach adapters on demand.

Adapter merging vs keeping adapters separate

After QLoRA training, you have two choices: merge the adapter weights permanently into the base model, or keep the adapter separate and load it dynamically at inference time.

Merge the adapter when: you have a single, stable distilled model that will not change frequently; you want the fastest possible inference (no adapter overhead); or you plan to re-quantise the merged model to INT4 or INT8 for further inference cost reduction. Merging is irreversible, so always keep the unmerged checkpoint.

Keep adapters separate when: you are running A/B tests between adapter versions; you are serving multiple task-specific variants from one base model; or you expect to retrain and swap the adapter frequently as your task distribution evolves.

# Merge adapter into base model using PEFT
from peft import PeftModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "/output/qwen3-8b-distilled-adapter")
merged = model.merge_and_unload()
merged.save_pretrained("/output/qwen3-8b-distilled-merged")

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Instruct")
tokenizer.save_pretrained("/output/qwen3-8b-distilled-merged")

Rollout strategy

Never replace the teacher entirely on day one. A phased rollout with a shadow evaluation period is the standard production practice, and the teams that skip it are the ones that get paged at 2 am.

The recommended rollout sequence is: shadow mode first (student runs in parallel with teacher, results logged but not served to users); then 5% of live traffic for 48 hours; then 25%, then 50%, then 100%, with at least 24 hours at each stage. At each stage, compare the student's answers to the teacher's on the real production traffic using your LLM-as-judge evaluator. Maintain a kill switch — a feature flag or routing rule — that instantly reverts all traffic to the teacher if quality metrics drop.

For agent observability during rollout, integrating OpenTelemetry traces on the student endpoint is strongly recommended — see the observability guide for the full tracing setup.

The break-even maths: when distillation ROI justifies the investment

Distillation has a real upfront cost: engineer time to build the data pipeline, compute for training, and ongoing maintenance when the task distribution drifts. The maths only works if the ongoing inference saving outpaces that upfront investment within a reasonable payback period. Here is a worked example.

Inputs to the calculation

Break-even model inputs (worked example — pricing as of June 2026; verify current rates before committing)
Variable	Example value	Notes
Monthly output tokens (teacher)	500,000,000 (500M)	From billing dashboard
Teacher API cost	$3.00 per 1M output tokens	Mid-tier frontier model pricing
Monthly teacher cost	$1,500 / month (approx £1,200)	500M × $3 ÷ 1M
Self-hosted GPU cost (A10G, on-demand)	$0.75/hour ≈ £0.60/hour	Single A10G at current spot pricing
Student throughput (vLLM, A10G)	100 tokens/second sustained	8B model, typical concurrency
Student output tokens per month	500M (same workload)	Assuming full replacement
GPU-hours to serve 500M tokens	~1,388 hours	500M ÷ (100 × 3,600)
Monthly self-hosted inference cost	$1,041 / month (approx £830)	1,388 × $0.75
Monthly saving	$459 / month (approx £370)	$1,500 − $1,041

At 500M output tokens per month on this teacher, the monthly saving is modest — roughly £370. The upfront distillation cost — a week of engineer time (£4,000 at senior rates) plus £15 in training compute — means payback takes approximately 11 months. That is a borderline case: reasonable for a long-lived feature, questionable for something experimental.

Now scale the volume. At 2 billion output tokens per month, the monthly teacher cost is £4,800. The self-hosted inference cost scales sublinearly because a single A10G can serve the full volume with high GPU utilisation, bringing the self-hosted monthly cost to approximately £3,300. The monthly saving is now £1,500, and the same £4,000 upfront investment pays back in under three months.

The volume thresholds that make distillation worthwhile

The numbers above translate into a practical rule of thumb:

Under 100M tokens/month: distillation rarely beats caching + routing on ROI. The engineering effort is the same, but the bill is small enough that caching and routing close most of the gap. Stick with the routing and caching playbook until volume grows.
100M–500M tokens/month: distillation is viable but only if the task is narrow and stable enough that the student will not need frequent retraining. Payback is typically 6–18 months.
500M+ tokens/month: distillation has a clear positive ROI for any stable narrow task. Payback is typically under 6 months. At this volume, it is worth investing in a proper distillation pipeline with automated retraining when distribution drift is detected.

Do not forget retraining costs in your model. If your task distribution shifts quarterly — new product lines, seasonal behaviour, regulatory changes — the teacher needs to re-generate outputs and the student needs to be retrained. Build that into your TCO model as 15–25% of the upfront cost per quarter. A distribution that changes monthly makes distillation significantly less attractive than routing, which adapts without retraining.

Cost per 1K outputs: before and after

Cost per 1,000 output tokens: teacher API vs distilled student (as of June 2026)
Configuration	Cost per 1K output tokens	At 500M tokens/month
GPT-4o (frontier teacher)	$0.010 per 1K	$5,000 / month
Mid-tier frontier (e.g. Claude 3 Haiku, Gemini Flash)	$0.003 per 1K	$1,500 / month
Distilled 8B student, A10G on-demand	$0.0021 per 1K	$1,041 / month
Distilled 8B student, A10G reserved 1-year	$0.0012 per 1K	$580 / month

The economics improve significantly if you can commit to reserved GPU capacity. A one-year reserved A10G instance reduces the per-hour cost by roughly 40% compared to on-demand, bringing the cost per 1K tokens below $0.0013 — an 87% reduction versus a mid-tier frontier API for a task where the student achieves parity.

Failure modes and how to avoid them

The distillation projects that fail in production almost always fail for one of four reasons. Knowing them in advance is the difference between a smooth rollout and an expensive rollback.

1. Distribution mismatch

The student is trained on one distribution of teacher outputs but encounters a different distribution in production. This typically happens when the training data was sourced from a period that is no longer representative — the product changed, the user base shifted, or a new category of input appeared. The failure mode is a sudden drop in student accuracy that the training metrics gave no warning of, because the training set looked fine.

Mitigate this by: monitoring the input distribution in production continuously (embedding distance drift is a reliable signal); scheduling periodic retraining on recent production logs; and keeping the teacher in the serving stack as a fallback that activates when the student's confidence scores drop below a threshold.

2. Student collapse on edge cases

The student learns to perform well on the common case but fails catastrophically on the tail of the input distribution — unusual phrasings, long inputs near the sequence length limit, inputs that contain adversarial patterns or sensitive content. Average accuracy looks fine; edge cases are silently broken.

The mitigation is building a hand-curated edge-case eval set at training time — before you have seen any student failure — and treating it as a hard gate in your evaluation pipeline. Include: the longest inputs in your distribution, the rarest domain sub-topics, adversarially rephrased versions of common inputs, and any known failure modes from the teacher's production history.

3. Teacher-style overfitting

With a very small training set (under 2,000 examples), the student can overfit to the teacher's idiosyncratic stylistic patterns rather than learning the underlying task structure. The outputs look like the teacher's outputs even when the task would be better served by a different style. Perplexity on the validation set looks acceptable, but human raters or your LLM judge rate the outputs as sounding robotic or over-formatted.

The fix is more diverse training data. Vary prompt phrasing across the training set, include examples from multiple time periods and contexts, and if possible include some negative examples (inputs where a correct response differs from the teacher's default style). If the dataset is genuinely constrained in size, reduce the number of epochs and increase LoRA dropout to 0.1.

4. Serving infrastructure debt

A distilled model that works in evaluation but is never properly operationalised — no monitoring, no rollback procedure, no version control on the adapter — becomes a maintenance liability within a few months. The team that shipped it moves on, the model quietly degrades as the task distribution drifts, and nobody notices until a business metric drops.

Treat your distilled model with the same rigour as any production service: version control all artifacts (base model hash, adapter checkpoint, training config, dataset hash), instrument the serving endpoint with latency and accuracy metrics, and schedule a quarterly distribution drift check as a calendar event from day one.

Quick-start checklist

The full distillation pipeline, compressed into a checklist for teams starting their first project:

Verify the business case: is monthly token volume above 500M on this specific task?
Define the task boundary precisely — narrow scope is a prerequisite for success.
Pull a representative sample of real production inputs (min 10,000 examples).
Deduplicate and cluster the inputs to verify distribution coverage.
Generate teacher completions (temperature 0 for deterministic tasks, CoT for reasoning).
Apply quality filters: length, format validation, and LLM-as-judge scoring.
Choose a student base model — default to Qwen 3 8B unless there is a specific reason otherwise.
Configure Axolotl YAML and run QLoRA training (3 epochs, lora_r 64, lr 2e-4).
Evaluate: domain accuracy delta vs teacher, MMLU/HumanEval regression check, latency benchmarks.
Gate: domain accuracy must be within 5% of teacher, no more than 3-point MMLU regression.
Decide: merge adapter or keep separate for Multi-LoRA serving.
Deploy via vLLM with shadow mode first, then phased traffic rollout.
Instrument: OpenTelemetry traces, accuracy monitoring, input distribution drift detection.
Schedule quarterly retraining review.

LLM Distillation in Production: Shrink Your Model, Keep the Quality