Why distillation exists: the three-way cost trade-off
By mid-2026, every team running LLMs at meaningful scale has at least three tools for managing inference costs: caching, routing, and distillation. They are not interchangeable. Each targets a different root cause of overspend, and choosing the wrong one for your workload is how projects deliver disappointing ROI despite months of engineering effort.
Caching targets repetition. If the same prompt prefix, the same retrieved context, or the same answer appears repeatedly, caching stops you paying full inference cost on the repeated portion. It is the safest lever — done correctly, it reduces cost without altering the model or the answer. The limit is that it only helps workloads with structural repetition. A high-variety task — where every input is genuinely distinct — gets almost nothing from caching. For the full caching and routing playbook, see our companion guide on routing and semantic caching.
Routing targets variety. It classifies each incoming query by complexity and sends cheap traffic to a budget model and hard traffic to a frontier model. It is excellent when your workload has a wide spread of difficulty — perhaps 60% simple queries and 40% hard ones — because it means you are not spending frontier money on simple work. The limit is that you still need the frontier model for the hard 40%, and if that fraction is large or unpredictable, routing does not eliminate your frontier bill; it merely reduces it.
Distillation targets the narrow-workload case. If your application does one thing repeatedly — classifies support tickets, extracts structured data from a fixed schema, generates SQL from natural language — and the distribution of that task is stable, distillation can train a small specialist model to replace the frontier model entirely for that workload. The savings are complete: once the student is deployed, the frontier API is no longer called for that use case at all.
The decision tree is therefore: start with caching if you have repetition; add routing if you have mixed-difficulty traffic; invest in distillation only if you have a narrow, high-volume, stable-distribution workload where a 7B–14B model can realistically match frontier quality on the specific task. If you are still deciding whether fine-tuning is right at all, the fine-tuning decision ladder is a useful prerequisite read.
| Method | Best condition | Worst condition | Infrastructure change |
|---|---|---|---|
| Caching | High prompt/response repetition | Every input is unique | Minimal — add cache layer |
| Routing | Wide difficulty spread in traffic | Uniform difficulty — all hard or all easy | Moderate — add classifier + model registry |
| Distillation | Narrow task, high volume, stable distribution | Broad multi-task workload, mixed distribution | High upfront — train + serve new model |
Two distillation approaches: outputs vs features
Once you have decided that distillation is the right lever, there is a second decision: what exactly does the student learn from? There are two fundamentally different approaches, and they have different tooling requirements, quality ceilings and practical complexity.
Approach 1: SFT on teacher outputs (black-box distillation)
The most widely deployed approach in production in 2026 is also the simplest conceptually. You use the teacher model as a data generator: run inference on your task dataset, collect the teacher's completions, filter for quality, and then train the student on those (prompt, completion) pairs using standard supervised fine-tuning. The teacher is a black box — you only need its text outputs, not its weights or internal states.
This approach works with any teacher, including closed-weights API models. Meta used exactly this approach to help train Llama 4 Scout and Maverick using outputs from the much larger Behemoth model. DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based students using the same technique. Google leveraged Gemini-family outputs during Gemma 2 and Gemma 3 development. The pattern is industry-standard.
The practical tooling is straightforward: generate outputs with the teacher API or local inference, write them to a JSONL file in the correct chat format, and train the student with Axolotl or TRL. The quality ceiling is moderate — the student learns to reproduce teacher outputs, but it does not learn the teacher's uncertainty or the probability mass the teacher placed on alternative tokens.
Approach 2: Feature-level / on-policy distillation (white-box)
Feature-level distillation — sometimes called on-policy distillation (OPD) or white-box distillation — trains the student to match the teacher's token-level probability distributions, not just its final text. Instead of treating the teacher's output as ground truth, the student minimises a divergence (typically forward or reverse KL divergence) between its predicted distribution and the teacher's at each token position.
This is a richer training signal. The teacher's distribution encodes uncertainty: when the teacher assigns 0.4 probability to token A and 0.3 to token B, the student learns that both are plausible, rather than treating the sampled token A as the only correct answer. This leads to a higher quality ceiling, particularly for reasoning tasks. TRL's GKDTrainer implements Generalised Knowledge Distillation, which adds on-policy sampling so the student is trained on sequences it actually generates — correcting the distribution mismatch that plagued earlier KL-distillation approaches.
The catch is access: you need the teacher's logits, which means you need the teacher's weights running simultaneously during training — or a compatible remote teacher server. You also need the teacher and student to share the same vocabulary (or a compatible mapping), which rules out cross-architecture scenarios. TRL's 2026 DistillationTrainer addresses the throughput problem with a generation buffer that decouples training batch size from generation batch size, delivering up to 40× speedup over naive implementations.
| Factor | SFT on outputs | Feature-level (GKD/OPD) |
|---|---|---|
| Teacher access required | Text outputs only (API is fine) | Full weights + matching vocabulary |
| Quality ceiling | Moderate — reproduces teacher behaviour | Higher — matches teacher uncertainty |
| Training complexity | Low — standard SFT pipeline | High — concurrent teacher inference needed |
| Recommended for | Most production cases; closed teacher APIs | Reasoning tasks; open-weight teacher families |
| Primary tooling | Axolotl, TRL SFTTrainer, LLM Foundry | TRL GKDTrainer / DistillationTrainer |
Choosing your student model: Llama 4, Gemma 4, Qwen 3
Selecting the right base model for your student is as important as the training recipe. In 2026 the open-weight landscape has consolidated around three dominant families: Meta's Llama 4, Google's Gemma 4, and Alibaba's Qwen 3. Each has distinct characteristics that affect both training cost and inference economics.
The general principle is to choose the smallest model that can realistically be trained to your target quality. A model that is twice as large requires roughly twice the VRAM, roughly twice the GPU hours to train, and at least twice the inference cost per token to serve. The right starting point for most production distillation projects in 2026 is the 7B–14B range.
| Model | Active params | Min VRAM (FP16) | Single-GPU serving | Best for |
|---|---|---|---|---|
| Qwen 3 1.7B | 1.7B dense | ~4 GB | Yes (any consumer GPU) | Highest-volume, narrowest tasks; latency-critical edge deployments |
| Qwen 3 8B | 8B dense | ~16 GB | Yes (A10G, L4) | Recommended default; multilingual; strong reasoning at the 8B scale |
| Gemma 4 12B | 12B dense | ~24 GB | Yes (L40S, A100 40GB) | Reasoning-heavy tasks; best quality ceiling in the sub-14B range |
| Qwen 3 14B | 14B dense | ~28 GB | Yes (A100 40GB) with quantisation | Step up from 8B if domain eval gap requires it |
| Llama 4 Scout | 17B active / 109B total MoE | ~70 GB | No — multi-GPU required | Not recommended as a student — use as teacher only |
Llama 4 Scout's MoE architecture makes it an attractive benchmark number but a poor student model choice: all 109B parameters must be resident in memory even though only 17B are active per forward pass, requiring multi-GPU serving that eliminates the inference cost savings that motivated distillation in the first place. Use Scout or Maverick as a teacher, not a student.
For multilingual workloads — relevant for both Indian-language applications and UK teams processing non-English content — Qwen 3 8B is the recommended default. Its training corpus has significantly better coverage of Hindi, Bengali, Tamil and other regional languages compared to the Llama and Gemma families at the same scale. For the self-hosting infrastructure decision that underpins your serving choice, see the self-host vs API inference guide.
Generating and filtering teacher outputs: quality over quantity
The quality of your teacher-generated dataset is the single largest determinant of distillation success. A student trained on 5,000 high-quality examples consistently outperforms one trained on 50,000 noisy ones. The generation and filtering stage deserves as much engineering attention as the training itself.
Step 1: Assemble your task dataset
Start with your real production inputs. Pull a representative sample from your request logs — at minimum 10,000 examples, ideally 50,000 or more for non-trivial tasks. Deduplicate aggressively: near-duplicate inputs in the training set are waste at best and source of overfitting at worst. Cluster the inputs and verify that your sample covers the full input distribution, including the tail cases your model handles least well in production. Those edge cases are disproportionately valuable in training data.
If you do not yet have production logs (you are building a new capability), you can generate synthetic inputs with a frontier model. Prompt the teacher to produce varied examples across the full expected distribution: different lengths, different complexity levels, different domain sub-topics. Synthetic input generation is slower and more expensive than mining logs, but it gives you control over distribution that log mining cannot.
Step 2: Generate teacher completions
Run each input through the teacher to produce a completion. For SFT-on-outputs distillation, use temperature 0 for deterministic, high-confidence outputs on tasks with a single correct answer (extraction, classification, structured generation). Use temperature 0.7–1.0 if you want the student to learn the teacher's natural variation (summarisation, open-ended generation). For reasoning tasks, chain-of-thought completions significantly outperform direct-answer completions as training targets — the student learns the reasoning path, not just the conclusion.
Budget the generation cost carefully. If your teacher is a frontier API model at $3 per million output tokens and you are generating 50,000 completions at an average of 400 tokens each, the generation cost is $60. That is negligible relative to training compute. For 500,000 completions at 1,000 tokens, the cost is $1,500 — still modest, but worth modelling before you start.
Step 3: Quality filter
Not all teacher outputs are equally good. Apply at minimum three filters before writing the dataset to disk.
Length filter: discard completions that are suspiciously short (likely refusals or degenerate outputs) or abnormally long (likely hallucination spirals). Set thresholds at the 2nd and 98th percentile of your expected output length distribution.
Format filter: for structured-output tasks (JSON extraction, code generation), programmatically validate that the output is parseable. A teacher that occasionally produces malformed JSON will train the student to do the same. Discard any completion that fails validation.
LLM-as-judge filter: for tasks where correctness is harder to verify programmatically (reasoning, summarisation, QA), run a second frontier model as a judge. Prompt it to score each (input, completion) pair on a 1–5 scale and discard pairs scoring 3 or below. This adds cost — roughly the same as the generation cost — but is the most reliable quality gate for subjective tasks.
Write the final dataset in JSONL format with one example per line. Axolotl natively supports the ShareGPT and Alpaca formats; the ShareGPT format is recommended for chat-tuned student models.
{
"conversations": [
{
"from": "human",
"value": "Extract the invoice number, date and total from the following text:\n\n'Invoice #INV-20481 dated 8 June 2026, total due £1,240.00'"
},
{
"from": "gpt",
"value": "{\"invoice_number\": \"INV-20481\", \"date\": \"2026-06-08\", \"total\": 1240.00, \"currency\": \"GBP\"}"
}
]
}
Training recipe: QLoRA on Axolotl
Axolotl is the recommended training framework for production distillation in 2026. Its YAML-driven configuration system, broad model support (Qwen, Gemma, Llama, Mistral, and more), and active maintenance (versions v0.28 and v0.29 shipped in early 2026 with FSDP2 integration and stable multimodal support) make it the lowest-friction path from dataset to trained adapter.
For a 7B–14B student on a task-specific dataset, QLoRA is the right training approach. It quantises the base model weights to 4-bit during training (reducing VRAM requirements by roughly 4×) while keeping the LoRA adapter weights in full precision. This makes it possible to train a 14B student on a single A100 80GB GPU, or a 7B student on an A10G 24GB GPU — hardware that is available on demand from any major cloud provider for £1.50–£2.50 per GPU-hour.
Axolotl configuration
Below is a production-ready Axolotl YAML configuration for distilling a Qwen 3 8B student. Adjust base_model, dataset paths and sequence length to your specific task.
base_model: Qwen/Qwen3-8B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
# Dataset — ShareGPT format, JSONL
datasets:
- path: /data/teacher_outputs_filtered.jsonl
type: sharegpt
conversation: chatml
dataset_prepared_path: /data/last_run_prepared
val_set_size: 0.02
output_dir: /output/qwen3-8b-distilled
# Sequence length — set to the 95th percentile of your data
sequence_len: 2048
sample_packing: true
# QLoRA adapter config
adapter: qlora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Training hyperparameters
num_epochs: 3
micro_batch_size: 4
gradient_accumulation_steps: 4 # effective batch size = 16
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05
optimizer: paged_adamw_8bit
weight_decay: 0.01
max_grad_norm: 1.0
# Logging and evaluation
logging_steps: 25
eval_steps: 200
save_steps: 200
save_total_limit: 3
load_best_model_at_end: true
# Flash Attention 2 for memory efficiency
flash_attention: true
# Weights and Biases experiment tracking (optional but recommended)
wandb_project: distillation-qwen3-8b
wandb_run_id: run-001
Key hyperparameter choices
LoRA rank (lora_r): 64 is the recommended default for distillation tasks. Lower ranks (16–32) are sufficient for simple classification or extraction tasks. Higher ranks (128) are worth trying if your domain eval accuracy plateaus at 64 and you have enough training data. For more on weight-decomposed LoRA variants, see the DoRA fine-tuning explainer.
Learning rate: 2e-4 with a cosine schedule works well for QLoRA on most tasks. If you see training loss spikes or the model starts producing incoherent outputs mid-training, drop the learning rate to 1e-4 and increase the warmup ratio to 0.1.
Epochs: 3 epochs is a reasonable default. Monitor validation loss carefully: if it starts rising before 3 epochs complete, stop early. Distillation tasks with small datasets (under 10,000 examples) are particularly prone to overfitting at epoch 3.
Sample packing: enable this unless your task has very short inputs. It concatenates multiple training examples into a single sequence up to sequence_len, eliminating padding waste and significantly improving GPU utilisation.
Training a Qwen 3 8B model for 3 epochs on 50,000 examples takes approximately 4–6 hours on a single A100 80GB GPU at the configuration above. At spot instance pricing (approximately $1.80/hour on major clouds), the compute cost is £8–£12 — a rounding error in the total distillation budget.
Evaluation: domain accuracy, benchmarks, latency
A distilled model is only ready for production when it has passed a structured evaluation suite. The evaluation has three distinct layers, and all three must pass before you replace the teacher in production.
Layer 1: Domain task accuracy
The primary evaluation is accuracy on your actual task. Build a held-out test set of at least 500 examples — ideally 2,000 — that was not used in training or validation. For structured-output tasks, evaluate with exact-match or programmatic correctness checks. For generation tasks, use a combination of ROUGE scores and an LLM-as-judge evaluation. The critical metric is accuracy delta versus teacher: what percentage of the student's answers match the teacher's answer or are judged equal-or-better quality.
For production sign-off, a common threshold is 95% of teacher accuracy on domain tasks. If the student achieves 90%, it may still be acceptable if the cost saving is dramatic — but that is a business decision that should be made explicitly, not as a side-effect of not measuring carefully enough.
Layer 2: General capability regression check
Fine-tuning a model for a narrow task can cause catastrophic forgetting — degradation of general capabilities that were present in the base model. Always run MMLU and HumanEval (or a subset of each) before and after training to verify that general language understanding and code generation have not regressed significantly. A 2–3 point MMLU regression is typical and acceptable; a 10+ point regression indicates the training process has damaged general capability and you should investigate before deploying.
Run the evaluation on both the base model checkpoint and the final distilled checkpoint. Track the delta, not the absolute score — what matters is how much the training changed the model.
Layer 3: Latency benchmarks
A cheaper model that is too slow creates a different class of production problem. Measure latency at the serving configuration you plan to use in production, under a realistic concurrency load. The metrics to track are:
| Metric | What it measures | Typical target |
|---|---|---|
| Time to first token (TTFT) p50/p95 | Responsiveness for interactive use cases | <200 ms p50 for single-GPU 8B model |
| Tokens per second (throughput) | Sustained generation speed | 80–150 tok/s for 8B on A10G with vLLM |
| End-to-end latency p50/p95 | Total round-trip for a typical request | Task-dependent — compare to teacher p50/p95 |
| Cost per 1K output tokens | Economics at scale | Compare self-hosted ÷ 1,000,000 to API price |
vLLM with PagedAttention is the recommended serving framework. On an A10G GPU (24 GB VRAM) serving a quantised Qwen 3 8B model, you can expect 80–120 tokens per second at typical prompt lengths, with TTFT under 200 ms at moderate concurrency. The throughput scales with GPU count if you tensor-parallelise, though for 7B–8B models a single GPU is usually the economically optimal configuration.
Production deployment: vLLM, adapter merging, rollout
Getting the distilled model into production involves three decisions: how to serve it, whether to merge the adapter, and how to roll it out safely.
Serving with vLLM
vLLM is the production standard for self-hosted LLM serving in 2026. Its PagedAttention algorithm, continuous batching, and multi-GPU tensor parallelism deliver the throughput that production workloads require. For a single 8B distilled model, the serving command is straightforward:
# Serve a merged distilled model with vLLM
vllm serve /output/qwen3-8b-distilled-merged \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--served-model-name qwen3-8b-distilled
# Or serve with an unmerged LoRA adapter (for multi-adapter setups)
vllm serve Qwen/Qwen3-8B-Instruct \
--enable-lora \
--lora-modules distilled=/output/qwen3-8b-distilled-adapter \
--max-loras 4 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85
vLLM's Multi-LoRA feature allows you to serve a single base model with up to 200 concurrent adapter variants, each adding only megabytes of memory overhead rather than the gigabytes a separate model deployment would require. This is particularly useful if you are distilling several task-specific students from the same base — serve the base once, attach adapters on demand.
Adapter merging vs keeping adapters separate
After QLoRA training, you have two choices: merge the adapter weights permanently into the base model, or keep the adapter separate and load it dynamically at inference time.
Merge the adapter when: you have a single, stable distilled model that will not change frequently; you want the fastest possible inference (no adapter overhead); or you plan to re-quantise the merged model to INT4 or INT8 for further inference cost reduction. Merging is irreversible, so always keep the unmerged checkpoint.
Keep adapters separate when: you are running A/B tests between adapter versions; you are serving multiple task-specific variants from one base model; or you expect to retrain and swap the adapter frequently as your task distribution evolves.
# Merge adapter into base model using PEFT
from peft import PeftModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "/output/qwen3-8b-distilled-adapter")
merged = model.merge_and_unload()
merged.save_pretrained("/output/qwen3-8b-distilled-merged")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Instruct")
tokenizer.save_pretrained("/output/qwen3-8b-distilled-merged")
Rollout strategy
Never replace the teacher entirely on day one. A phased rollout with a shadow evaluation period is the standard production practice, and the teams that skip it are the ones that get paged at 2 am.
The recommended rollout sequence is: shadow mode first (student runs in parallel with teacher, results logged but not served to users); then 5% of live traffic for 48 hours; then 25%, then 50%, then 100%, with at least 24 hours at each stage. At each stage, compare the student's answers to the teacher's on the real production traffic using your LLM-as-judge evaluator. Maintain a kill switch — a feature flag or routing rule — that instantly reverts all traffic to the teacher if quality metrics drop.
For agent observability during rollout, integrating OpenTelemetry traces on the student endpoint is strongly recommended — see the observability guide for the full tracing setup.
The break-even maths: when distillation ROI justifies the investment
Distillation has a real upfront cost: engineer time to build the data pipeline, compute for training, and ongoing maintenance when the task distribution drifts. The maths only works if the ongoing inference saving outpaces that upfront investment within a reasonable payback period. Here is a worked example.
Inputs to the calculation
| Variable | Example value | Notes |
|---|---|---|
| Monthly output tokens (teacher) | 500,000,000 (500M) | From billing dashboard |
| Teacher API cost | $3.00 per 1M output tokens | Mid-tier frontier model pricing |
| Monthly teacher cost | $1,500 / month (approx £1,200) | 500M × $3 ÷ 1M |
| Self-hosted GPU cost (A10G, on-demand) | $0.75/hour ≈ £0.60/hour | Single A10G at current spot pricing |
| Student throughput (vLLM, A10G) | 100 tokens/second sustained | 8B model, typical concurrency |
| Student output tokens per month | 500M (same workload) | Assuming full replacement |
| GPU-hours to serve 500M tokens | ~1,388 hours | 500M ÷ (100 × 3,600) |
| Monthly self-hosted inference cost | $1,041 / month (approx £830) | 1,388 × $0.75 |
| Monthly saving | $459 / month (approx £370) | $1,500 − $1,041 |
At 500M output tokens per month on this teacher, the monthly saving is modest — roughly £370. The upfront distillation cost — a week of engineer time (£4,000 at senior rates) plus £15 in training compute — means payback takes approximately 11 months. That is a borderline case: reasonable for a long-lived feature, questionable for something experimental.
Now scale the volume. At 2 billion output tokens per month, the monthly teacher cost is £4,800. The self-hosted inference cost scales sublinearly because a single A10G can serve the full volume with high GPU utilisation, bringing the self-hosted monthly cost to approximately £3,300. The monthly saving is now £1,500, and the same £4,000 upfront investment pays back in under three months.
The volume thresholds that make distillation worthwhile
The numbers above translate into a practical rule of thumb:
- Under 100M tokens/month: distillation rarely beats caching + routing on ROI. The engineering effort is the same, but the bill is small enough that caching and routing close most of the gap. Stick with the routing and caching playbook until volume grows.
- 100M–500M tokens/month: distillation is viable but only if the task is narrow and stable enough that the student will not need frequent retraining. Payback is typically 6–18 months.
- 500M+ tokens/month: distillation has a clear positive ROI for any stable narrow task. Payback is typically under 6 months. At this volume, it is worth investing in a proper distillation pipeline with automated retraining when distribution drift is detected.
Cost per 1K outputs: before and after
| Configuration | Cost per 1K output tokens | At 500M tokens/month |
|---|---|---|
| GPT-4o (frontier teacher) | $0.010 per 1K | $5,000 / month |
| Mid-tier frontier (e.g. Claude 3 Haiku, Gemini Flash) | $0.003 per 1K | $1,500 / month |
| Distilled 8B student, A10G on-demand | $0.0021 per 1K | $1,041 / month |
| Distilled 8B student, A10G reserved 1-year | $0.0012 per 1K | $580 / month |
The economics improve significantly if you can commit to reserved GPU capacity. A one-year reserved A10G instance reduces the per-hour cost by roughly 40% compared to on-demand, bringing the cost per 1K tokens below $0.0013 — an 87% reduction versus a mid-tier frontier API for a task where the student achieves parity.
Failure modes and how to avoid them
The distillation projects that fail in production almost always fail for one of four reasons. Knowing them in advance is the difference between a smooth rollout and an expensive rollback.
1. Distribution mismatch
The student is trained on one distribution of teacher outputs but encounters a different distribution in production. This typically happens when the training data was sourced from a period that is no longer representative — the product changed, the user base shifted, or a new category of input appeared. The failure mode is a sudden drop in student accuracy that the training metrics gave no warning of, because the training set looked fine.
Mitigate this by: monitoring the input distribution in production continuously (embedding distance drift is a reliable signal); scheduling periodic retraining on recent production logs; and keeping the teacher in the serving stack as a fallback that activates when the student's confidence scores drop below a threshold.
2. Student collapse on edge cases
The student learns to perform well on the common case but fails catastrophically on the tail of the input distribution — unusual phrasings, long inputs near the sequence length limit, inputs that contain adversarial patterns or sensitive content. Average accuracy looks fine; edge cases are silently broken.
The mitigation is building a hand-curated edge-case eval set at training time — before you have seen any student failure — and treating it as a hard gate in your evaluation pipeline. Include: the longest inputs in your distribution, the rarest domain sub-topics, adversarially rephrased versions of common inputs, and any known failure modes from the teacher's production history.
3. Teacher-style overfitting
With a very small training set (under 2,000 examples), the student can overfit to the teacher's idiosyncratic stylistic patterns rather than learning the underlying task structure. The outputs look like the teacher's outputs even when the task would be better served by a different style. Perplexity on the validation set looks acceptable, but human raters or your LLM judge rate the outputs as sounding robotic or over-formatted.
The fix is more diverse training data. Vary prompt phrasing across the training set, include examples from multiple time periods and contexts, and if possible include some negative examples (inputs where a correct response differs from the teacher's default style). If the dataset is genuinely constrained in size, reduce the number of epochs and increase LoRA dropout to 0.1.
4. Serving infrastructure debt
A distilled model that works in evaluation but is never properly operationalised — no monitoring, no rollback procedure, no version control on the adapter — becomes a maintenance liability within a few months. The team that shipped it moves on, the model quietly degrades as the task distribution drifts, and nobody notices until a business metric drops.
Treat your distilled model with the same rigour as any production service: version control all artifacts (base model hash, adapter checkpoint, training config, dataset hash), instrument the serving endpoint with latency and accuracy metrics, and schedule a quarterly distribution drift check as a calendar event from day one.
Quick-start checklist
The full distillation pipeline, compressed into a checklist for teams starting their first project:
- Verify the business case: is monthly token volume above 500M on this specific task?
- Define the task boundary precisely — narrow scope is a prerequisite for success.
- Pull a representative sample of real production inputs (min 10,000 examples).
- Deduplicate and cluster the inputs to verify distribution coverage.
- Generate teacher completions (temperature 0 for deterministic tasks, CoT for reasoning).
- Apply quality filters: length, format validation, and LLM-as-judge scoring.
- Choose a student base model — default to Qwen 3 8B unless there is a specific reason otherwise.
- Configure Axolotl YAML and run QLoRA training (3 epochs,
lora_r64, lr 2e-4). - Evaluate: domain accuracy delta vs teacher, MMLU/HumanEval regression check, latency benchmarks.
- Gate: domain accuracy must be within 5% of teacher, no more than 3-point MMLU regression.
- Decide: merge adapter or keep separate for Multi-LoRA serving.
- Deploy via vLLM with shadow mode first, then phased traffic rollout.
- Instrument: OpenTelemetry traces, accuracy monitoring, input distribution drift detection.
- Schedule quarterly retraining review.