What LoRA does and why it works
Low-Rank Adaptation (LoRA) is the workhorse of parameter-efficient fine-tuning. The core idea is elegantly simple. A pre-trained weight matrix W in a transformer has dimensions d × k. During fine-tuning you want to update W, but doing so for every parameter in a 7B or 70B model requires enormous GPU memory and compute. LoRA instead freezes the original W and learns a low-rank decomposition of the update: it adds two small matrices A (d × r) and B (r × k), where r is the chosen rank — typically 8, 16, or 32. The effective weight during a forward pass is W + BA, and because r is much smaller than d or k, the total number of trainable parameters is a tiny fraction of the original model size.
For a 7B Llama-class model, applying LoRA at rank 16 to all attention projection layers reduces the trainable parameter count from 7 billion to roughly 20–40 million — less than 0.6 percent of the model. That footprint fits comfortably in the spare VRAM of a GPU that is already holding the quantised base weights. The frozen W lives in 4-bit or 8-bit precision; the small A and B matrices train in full bfloat16. The memory maths work out, and the quality is surprisingly close to full fine-tuning on many tasks.
The practical consequence is that a single 24GB consumer GPU — an RTX 3090, RTX 4090, or the L4 available on Google Colab — can fine-tune a 7B model on a custom dataset in a few hours. That capability, which would have required a multi-node A100 cluster two years ago, is now accessible to a solo developer in Bengaluru or Bristol. LoRA is why parameter-efficient fine-tuning became the default approach for the entire community.
Set target_modules="all-linear" rather than listing individual attention projections by name. This applies LoRA to every linear layer in the model — attention projections plus feed-forward layers — and consistently outperforms attention-only LoRA on instruction-following and reasoning tasks. The parameter count increase is modest; the quality gain is not.
The LoRA limitation: coupled magnitude and direction updates
LoRA's design has a subtle weakness that the DoRA authors identify with precision. When you add the low-rank product BA to the frozen weight W, the resulting update entangles two distinct properties of the weight matrix: its magnitude (how large the weight values are) and its direction (the orientation of the weight vectors in parameter space). Both properties are updated simultaneously via the same BA product, and they have fundamentally different learning dynamics.
Magnitude updates govern how strongly a given weight channel responds to its input — it is a scaling operation. Direction updates govern which patterns in the input the weight channel is sensitive to — it is a rotation operation. In full fine-tuning, the gradient descent optimiser handles these separately because each parameter receives its own gradient. In LoRA, the low-rank constraint forces a compromise: the rank-r subspace that BA spans must simultaneously serve the magnitude and direction update budgets. Neither gets the full expressivity that unconstrained optimisation would give it.
The empirical consequence, documented in the DoRA paper, is that LoRA adapters show a distinctive analysis pattern: they tend to make large magnitude updates and small direction updates, or vice versa, depending on the task — but rarely optimise both simultaneously in a way that matches full fine-tuning behaviour. This coupling is not catastrophic; LoRA still works well. But it leaves measurable performance on the table, particularly on tasks that require nuanced adaptation of both the scale and the orientation of weight representations.
What DoRA adds: separating magnitude from direction
DoRA (Weight-Decomposed Low-Rank Adaptation) resolves this by decomposing the pre-trained weight matrix W into two components before training begins. Specifically, it writes W as the product of a magnitude vector m (one scalar per output channel) and a normalised direction matrix V (the unit-norm version of W). The decomposition is: W = m × V / ||V||, where the division normalises each column of V to unit norm.
During fine-tuning, DoRA updates V using a standard LoRA low-rank product — the direction component gets the same efficient parameterisation as vanilla LoRA. But magnitude m is updated as a separate, lightweight set of learned scalars, one per output channel. Because the two components are now independent parameters with independent gradients, the optimiser can evolve them at different rates according to the actual curvature of the loss landscape, without either one constraining the other.
The weight count overhead of this decomposition is minimal. For a linear layer with d output channels, DoRA adds d extra parameters for the magnitude vector. At model scale, this is negligible — adding roughly 0.01 percent to the total trainable parameter count compared to vanilla LoRA. The computational overhead during the forward pass is similarly small: one additional element-wise multiplication per linear layer to apply the magnitude scaling.
Critically, DoRA is a drop-in addition. Any existing LoRA configuration can be upgraded to DoRA by setting a single flag (use_dora=True in PEFT or Unsloth). The rank, alpha, target modules, and every other hyperparameter remain identical. There is no new training loop, no changed data format, no additional infrastructure requirement.
Empirical results: where DoRA wins
The DoRA paper evaluates the method across three task families, and the gains are consistent enough to be practically meaningful rather than merely statistically significant.
Commonsense reasoning. Tested on a suite of eight commonsense reasoning benchmarks — including BoolQ, PIQA, HellaSwag, WinoGrande, ARC-Easy, and ARC-Challenge — DoRA outperforms LoRA by 1–3 percentage points when both are applied to LLaMA-7B with identical rank and training budget. On a subset of the harder benchmarks, DoRA's performance is within 1 percentage point of full fine-tuning, despite using roughly 0.6 percent of the trainable parameters.
Visual instruction tuning. Applied to LLaVA-1.5, a multimodal model that requires adapting both the language model backbone and the vision-language projection layer, DoRA consistently outperforms LoRA on science and mathematical visual question-answering benchmarks. The gains here are particularly notable because visual instruction tuning requires adapting weight directions significantly — the model must learn new cross-modal associations — which is precisely the regime where DoRA's uncoupled direction updates provide the most benefit.
Instruction following. On instruction-following evaluations, DoRA improves MT-Bench scores and produces qualitatively better completions on nuanced multi-turn prompts. The improvement is most visible in tasks that require the model to maintain a specific persona, follow complex formatting instructions, or reason through multi-step problems consistently across a conversation.
The consistent pattern across all three domains is the same: DoRA matches or exceeds full fine-tuning in scenarios where LoRA leaves a gap, while maintaining the memory efficiency that makes PEFT methods practical. For builders who were already getting good results with LoRA, DoRA is a free upgrade. For builders where LoRA was falling short of full fine-tuning quality, DoRA may close that gap without requiring additional compute.
Hyperparameter defaults that work
DoRA introduces no new hyperparameters of its own — you configure it exactly as you would a LoRA run. The following defaults are a solid starting point for the 7B–13B model range on most fine-tuning tasks:
- Rank (r): 16. This is the most broadly applicable value. Increase to 32 if you have a large, diverse dataset and want more expressive capacity; reduce to 8 for very small datasets (under 500 examples) to limit overfitting.
- Alpha: 16. Setting alpha equal to r keeps the LoRA scaling factor at 1.0, which simplifies learning rate tuning. Some practitioners prefer alpha = 2 × r for a slight quality boost, at the cost of needing a lower learning rate to compensate.
- Target modules:
all-linear. Apply DoRA to every linear layer including feed-forward layers, not just attention projections. - Learning rate: 2e-4 with a cosine decay schedule and a 3–5 percent warm-up phase. This is the standard LoRA learning rate and works equally well for DoRA.
- Epochs: 1–3. One epoch is sufficient for large, high-quality datasets; three epochs is the upper bound before diminishing returns and overfitting risk become significant. If your dataset is under 1,000 examples, monitor validation loss carefully after epoch 1.
- Dropout: 0.05–0.1 on the LoRA matrices. Small amounts of dropout reduce overfitting on small datasets without meaningfully hurting convergence.
For 70B-class models, pair these settings with 4-bit NF4 quantisation to produce a QDoRA run. The training dynamics are essentially unchanged; you are simply reducing the memory footprint of the frozen base weights to make the configuration viable on a single 80GB H100.
Toolchain: Unsloth, Axolotl, and TRL
Three tools cover the vast majority of DoRA fine-tuning workflows in 2026. Your choice between them depends on whether you prioritise speed, configurability, or objective diversity.
Unsloth is the right choice for speed on consumer hardware. It provides hand-written CUDA kernels for the LoRA and DoRA forward and backward passes that are roughly 2x faster than the equivalent Hugging Face PEFT implementation on the same GPU. The gap is widest on single-GPU consumer setups (RTX 3090, RTX 4090, L4, A10) — exactly the hardware most individual builders and small teams are working with. Unsloth supports DoRA natively via a single flag, handles 4-bit QDoRA automatically, and includes a built-in Alpaca/ChatML dataset formatter that reduces boilerplate for common fine-tuning tasks. It is also the recommended starting point for Indian AI labs working on domain-specific models where iteration speed on limited hardware is the binding constraint.
Axolotl is the right choice when you want full control via a YAML configuration file and need to run repeatable, auditable experiments. Axolotl wraps Hugging Face PEFT and Transformers behind a declarative config format, making it easy to version-control your entire training run, reproduce experiments across machines, and share configurations with teammates. It supports DoRA, QLoRA, and full fine-tuning with the same config structure, which makes comparison experiments straightforward. UK enterprise teams operating under change-management processes that require documented, reproducible training pipelines tend to gravitate towards Axolotl.
TRL (Transformer Reinforcement Learning) is the right choice when your objective goes beyond supervised fine-tuning. If you need Direct Preference Optimisation (DPO), RLHF, ORPO, or SimPO, TRL handles these natively while still supporting LoRA and DoRA adapters via PEFT. The SFTTrainer in TRL is a thin wrapper around Hugging Face Trainer that adds dataset formatting helpers and token-length filtering. For teams building instruction-tuned or preference-aligned models, TRL is the standard tool — and DoRA slots in via the same use_dora=True flag in the PEFT config.
Step-by-step DoRA setup with Unsloth
The following code configures a DoRA run on a 7B instruct model using Unsloth. This runs on a single 24GB GPU.
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Load model with 4-bit quantisation (QDoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.1-8B-Instruct",
max_seq_length=2048,
dtype=None, # auto-detect bfloat16 / float16
load_in_4bit=True,
)
# Apply DoRA adapter
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules="all-linear",
lora_alpha=16,
lora_dropout=0.05,
bias="none",
use_dora=True, # the only flag that differs from vanilla LoRA
random_state=42,
)
# Load and format dataset (Alpaca-style)
dataset = load_dataset("json", data_files="train.jsonl", split="train")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
output_dir="./dora-output",
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
fp16=False,
bf16=True,
logging_steps=25,
save_steps=500,
seed=42,
),
)
trainer.train()
# Save merged model (optional — for deployment without PEFT dependency)
model.save_pretrained_merged("./dora-merged", tokenizer, save_method="merged_16bit")
When saving a merged DoRA model, use save_method="merged_16bit" or "merged_4bit_forced" — never "lora". The DoRA magnitude scalars are baked into the merged weights during export; saving only the LoRA matrices without the magnitude component produces a broken checkpoint that silently ignores the magnitude update and reverts to vanilla LoRA behaviour.
Dataset quality beats dataset quantity
The single most impactful decision in a DoRA fine-tuning run is not the rank, the learning rate, or the toolchain. It is the quality of the training data. A consistent finding across the community — supported by formal ablation studies and years of practitioner experience — is that 500 clean, representative examples consistently outperform 5,000 noisy ones on task-specific evaluation metrics.
The mechanism is straightforward. Fine-tuning pushes the model's weights towards the distribution represented by the training data. If that data contains incorrect answers, inconsistent formatting, contradictory instructions, or domain mismatch, the model learns those patterns. LoRA and DoRA do not correct for data quality; they are efficient learners, which means they efficiently learn whatever you give them — including the noise.
Filtering heuristics that work in practice:
- Length filter: Remove examples where the input is under 30 tokens or over 90 percent of your target sequence length. Very short examples often lack context; very long ones tend to be noisy or misformatted.
- Deduplication: Use MinHash or exact hash deduplication. Duplicate examples cause the model to memorise rather than generalise, and they inflate the apparent dataset size without adding information.
- Output quality filter: If your dataset has model-generated responses, filter out examples where a reference model scores the response below a quality threshold. Even using the base model you are fine-tuning as the reference provides a useful signal.
- Domain relevance: For domain-specific fine-tuning (medical, legal, financial, regional language), sample from primary sources rather than general web crawls. The corpus that trained the base model is already general-domain; you need to add domain signal, not more general signal.
- Human review of a random 5 percent sample: No automated filter catches everything. Review 25–50 randomly sampled examples before training. Patterns of systematic noise are almost always visible at this scale.
For Indian AI labs building domain-specific models — whether in Hindi, Tamil, Telugu, or other regional languages — this principle is especially important. The base model's pre-training coverage of these languages is uneven, and the signal-to-noise ratio of publicly available instruction datasets in those languages is lower than for English. Starting with 500 carefully curated examples in the target language will outperform rushing to 10,000 scraped ones.
Building domain-specific models? Find the fine-tuning specialists.
AI Tech Connect is the directory where Indian and UK AI Builders are discovered. Add your profile — free at launch.
Browse Builders →Evaluation pitfalls that cost builders time
Low training loss is not a success criterion. This bears repeating because it is the most common mistake in first-time fine-tuning projects, and it costs weeks of iteration time when teams discover the problem in production rather than in evaluation.
A model can reach near-zero training loss by memorising the training set. It can simultaneously perform no better than the base model — or worse — on the actual task you care about. The training loss measures how well the model has learned to reproduce the training examples; it says nothing about generalisation, task accuracy, or robustness to prompt variations it has never seen.
Always measure these three things after every fine-tuning run:
Task-specific accuracy on a held-out test set. Hold out at least 10–15 percent of your dataset before training begins. Never let the model see these examples during training. Evaluate task accuracy — the metric that actually matters for your application, not cross-entropy loss — on this test set after training. If accuracy does not improve materially over the base model, the fine-tuning run did not work, regardless of what the training curve looks like.
MMLU delta. Run the full MMLU benchmark on both the base model and your fine-tuned model. MMLU covers 57 academic subjects and is a reliable proxy for general reasoning capability. A fine-tuning run that drops MMLU by more than 2–3 percentage points has caused capability regression — you have traded general intelligence for task specialisation, which is often not what was intended. If your MMLU delta is negative and large, increase regularisation (higher dropout, more gradient accumulation, fewer epochs) or improve dataset quality.
Qualitative review of completions. Inspect at least 20–30 model completions on representative prompts from your task, including edge cases and adversarial inputs. Quantitative metrics miss failure modes that are immediately obvious to a human reviewer — hallucinated facts, broken formatting, inconsistent persona, or refusals on legitimate requests. This review should happen before you promote any fine-tuned model to production.
For UK enterprise teams subject to formal model governance requirements, keep a structured evaluation log for every run: dataset version hash, training configuration, test set accuracy, MMLU delta, and the names of team members who performed the qualitative review. This documentation is increasingly expected by enterprise clients and, in regulated sectors, by compliance functions.
LoRA vs QLoRA vs DoRA: when to use which
The comparison table below covers the practical decision points. The short version: DoRA is a free upgrade over vanilla LoRA and should be your default for new fine-tuning work. QLoRA / QDoRA is the path for fitting 13B+ models on a single GPU. Full fine-tuning is reserved for cases where you have the compute, the data, and a compelling reason why adapters will not suffice.
| Method | VRAM (7B model) | Trainable params | Convergence quality | Best use case |
|---|---|---|---|---|
| Full fine-tuning | ~112GB (BF16) | 100% | Highest | Large proprietary datasets, maximising task performance regardless of compute cost |
| LoRA (r=16) | ~18GB (BF16 base + adapters) | ~0.5% | Good — some gap vs full FT | Standard fine-tuning on single GPU; fastest iteration cycle |
| QLoRA (4-bit + r=16) | ~8GB | ~0.5% | Good — slight quality cost from quantisation | Fine-tuning 7B–13B on consumer GPU; 70B on single H100 |
| DoRA (r=16) | ~18GB (BF16 base + adapters) | ~0.51% (tiny magnitude overhead) | Better — matches full FT more closely | Drop-in LoRA upgrade; commonsense reasoning, instruction following |
| QDoRA (4-bit + r=16) | ~8GB | ~0.51% | Best PEFT option for constrained hardware | Best quality per VRAM; recommended default for most builders |
The practical recommendation for most builders in 2026 is QDoRA as the default. You get the memory efficiency of 4-bit quantisation, the quality improvement of DoRA over vanilla LoRA, and a workflow that runs on widely available hardware — from an RTX 4090 at a desk in Chennai to an A100 node on AWS eu-west-2 in London. Full fine-tuning remains relevant for organisations that have invested in dedicated training infrastructure and need to extract every fraction of a percentage point of task performance — but that is a much narrower set of use cases than the community sometimes assumes.
The open-weight model landscape also matters here. As models like those listed in the April 2026 open-weight round-up continue to improve their base capabilities, the marginal value of full fine-tuning over DoRA adapters diminishes further. A strong base model with a clean DoRA adapter is increasingly competitive with a moderately capable model that has been fully fine-tuned. Invest in dataset quality first; reach for full fine-tuning only when you have exhausted the adapter approach.
Indian AI labs working on multilingual and domain-specific models will find DoRA particularly valuable in the context of the IndiaAI Mission's 12 sovereign LLM partner programme. Teams like Sarvam AI, building multilingual stacks across Indian languages, are constrained by data quality and hardware availability far more than by training method expressivity. DoRA's ability to improve convergence without adding infrastructure requirements is a practical advantage in that context.
For UK enterprise teams working on instruction-tuned models for regulated sectors — financial services, healthcare, public sector — the governance-friendliness of the adapter approach is an additional argument. A DoRA adapter is a small, versioned, auditable artefact (typically 50–200MB) that can be attached to and detached from the frozen base model. This makes it straightforward to roll back a problematic fine-tune, maintain separate adapters for different regulatory contexts, and document exactly what changed between model versions. Full fine-tuning produces an entirely new model checkpoint that is much harder to audit for incremental changes.
The Llama 4 deployment guide covers the serving infrastructure side in detail — once you have a DoRA-fine-tuned adapter, the serving pipeline is identical to any other LoRA deployment, which is a further reason to adopt DoRA without hesitation.