What GPU do I need to fine-tune a 7B or 8B model?

With QLoRA, an RTX 4070 Ti with 12 GB of VRAM handles a 7B-8B model at modest sequence lengths. An RTX 4090 with 24 GB is the cost-and-speed sweet spot for anything under 34B in 2026. Cloud rental on RunPod, Lambda or Vast.ai removes the need to buy hardware at all.

How much does a real fine-tune cost in 2026?

Per 2026 community benchmarks, QLoRA on a single A100 80GB fine-tunes Llama 3 8B on roughly 50,000 examples in about six hours for around USD 12. On a single H100 the same job runs eight to twelve hours and costs USD 10 to 16. A smaller 1,000-example run finishes in well under an hour for a few cents.

When should I not fine-tune at all?

If your problem is about giving the model fresh facts or private documents, use retrieval-augmented generation. If you can solve it by improving the prompt or adding a few in-context examples, do that first. Fine-tune only when you need to change the model's behaviour, format or tone in a way that prompting cannot reach reliably.

How to Fine-Tune an LLM on a Budget with LoRA and QLoRA

Q: What rank should I use for LoRA?

Start with r=16. It is the recommended default for instruction fine-tuning, style adaptation and most domain specialisation. The 2026 consensus, per Unsloth's documentation, is to set alpha equal to r, so alpha = r = 16. Only raise the rank if validation loss plateaus high and you have confirmed your data is clean.

What you need to know before you start

QLoRA put fine-tuning on consumer hardware. Quantise the base model to 4-bit, train small adapters in higher precision, and a 7B-8B model fits inside a 12 GB GPU you can buy second-hand.
It is genuinely cheap. Per 2026 community benchmarks, a full Llama 3 8B fine-tune on ~50,000 examples costs roughly USD 12 of rented compute. A focused 1,000-example run costs a few cents.
Data quality beats data volume. Around 1,000 hand-checked examples routinely outperform 100,000 noisy ones. Curation is the real work.
Sometimes you should not fine-tune. If you need fresh facts, reach for retrieval-augmented generation first. Fine-tuning changes behaviour, not knowledge.

For most of the past few years, "fine-tune your own model" meant a cluster of GPUs, a six-figure cloud bill and an MLOps team. That is no longer true. A solo builder in Pune or a two-person studio in Bristol can now take an open-weight model, adapt it to a specific task, and ship the result the same week — using one GPU and a dataset they assembled by hand. This guide is the practical version: what the techniques are, what hardware you need, what it costs, and what to actually do this weekend.

What LoRA and QLoRA are, and why they made fine-tuning accessible

Full fine-tuning updates every weight in the model. For an 8B-parameter model that means holding the weights, their gradients and the optimiser state in memory at once — typically four to six times the raw model size in VRAM. That is the bill that priced solo builders out.

LoRA — Low-Rank Adaptation — changes the deal. Instead of updating the original weight matrices, it freezes them and trains a pair of small, low-rank matrices alongside each one. At inference the adapter output is added back to the frozen layer. You are training perhaps one to two per cent of the parameter count, so gradients and optimiser state shrink dramatically. The base model still sits in memory, but it never needs gradients. The artefact you produce is a small adapter file — often tens of megabytes — that you can ship, version and swap independently of the base model.

QLoRA goes one step further. It loads the frozen base model in 4-bit quantised form, cutting the largest single memory cost roughly fourfold, while the trainable adapters stay in a higher precision such as bf16. The clever part is that quantisation error in the frozen base is tolerable precisely because the adapters are free to compensate during training. The result is a workflow where the base model is cheap to hold and the trainable part is small — which is exactly why a 12 GB consumer card can now do work that once needed a data-centre GPU.

Pro tip

Keep the base model and the adapter as separate files for as long as you can. One frozen base plus several small adapters lets you serve many fine-tuned behaviours from a single GPU, and lets you roll back a bad fine-tune by swapping a file rather than redeploying gigabytes.

The VRAM maths: which GPU fits which model

This is the number that decides whether you can do the job on hardware you own. Full fine-tuning of a 7B-8B model realistically wants 80 GB or more of VRAM. LoRA brings that down to roughly 16-24 GB, because you no longer carry gradients and optimiser state for the full weight set. QLoRA pushes it further still — roughly 8-12 GB for the same 7B-8B model — by quantising the base model to 4-bit and training the adapters in higher precision.

That 8-12 GB figure is the one that matters. It makes an RTX 4070 Ti, a 12 GB card that sells second-hand around the USD 300 mark, a viable fine-tuning machine for 7B-8B models at modest sequence lengths. Step up to an RTX 4090 with 24 GB and you get comfortable headroom, longer context windows and faster throughput. For anything under 34B in 2026, QLoRA on an RTX 4090 is the cost-and-speed sweet spot — enough memory to avoid constant out-of-memory errors, fast enough that iteration feels interactive.

Method	VRAM (7B-8B)	Relative cost	Relative speed	Best for
Full fine-tune	~80 GB+	Highest	Slowest per epoch	Teams with data-centre GPUs and a deep behaviour change
LoRA	~16-24 GB	Moderate	Fast	A single 24 GB card, full-precision base, longer context
QLoRA	~8-12 GB	Lowest	Fast (slightly below LoRA)	Consumer 12 GB cards, solo builders, weekend iteration

If you do not own a 12 GB card, you do not need to buy one. Cloud GPU rental on RunPod, Lambda and Vast.ai puts an A100 or H100 in front of you for the length of one training run and nothing more. A builder in Manchester and a builder in Hyderabad pay the same hourly rate, and neither has to find shelf space for a graphics card. Rent for the run, save the adapter, shut the instance down.

Real costs: what a fine-tune actually bills you

Concrete numbers help more than abstractions. Per 2026 community benchmarks, QLoRA on a single A100 80GB fine-tunes Llama 3 8B on around 50,000 examples in roughly six hours, for roughly USD 12 of rented compute. The same job on a single H100 runs about eight to twelve hours and costs in the region of USD 10 to 16, depending on the provider and the spot-versus-on-demand price you catch.

Those figures are for a substantial 50,000-example dataset. The more common starting point — a focused 1,000-example run — finishes in well under an hour and costs a few cents. The economics are no longer a barrier; the dataset is. For builders watching their inference bill rather than their training bill, our comparison of inference platforms across DeepInfra, Together, Fireworks and Groq covers where a fine-tuned open-weight model is cheapest to serve once it is trained.

Watch out

Spot and interruptible instances are cheap, but they can be reclaimed mid-run. If your training job has no checkpointing, an eviction at hour five of a six-hour run costs you the whole job. Always write checkpoints to persistent storage and confirm you can resume from one before you trust a spot instance with a long fine-tune.

Key hyperparameters: start simple

Builders new to fine-tuning often lose days tuning knobs that barely move the result. Resist that. The rank, written r, controls the size of the LoRA adapters. A value of r=16 is the recommended default for instruction fine-tuning, style adaptation and most domain specialisation — it is large enough to capture a meaningful behaviour change and small enough to train quickly and resist overfitting on a modest dataset.

The companion parameter is alpha, which scales the adapter's contribution. The 2026 consensus, per Unsloth's documentation, is to start with alpha equal to r — so alpha = r = 16. That keeps the scaling neutral and predictable. Only depart from these defaults once you have a clean dataset and a validation curve that tells you the model is under-fitting. Raising the rank before you have ruled out data problems just trains a bigger adapter on the same noise.

The 2026 toolchain

The software side has settled into a stable, well-documented stack. You want Python 3.11 or newer, PyTorch 2.5 or newer, and a CUDA 12.x driver. On top of that sits the Hugging Face ecosystem: transformers for the models, datasets for loading and shaping data, peft for the LoRA and QLoRA adapter logic, and trl for the training loop itself.

Three higher-level tools wrap that core for different working styles. Unsloth is the choice for speed on consumer hardware — it patches the training kernels to run faster and use less memory, which matters most on a 12 GB card. Axolotl is the choice if you prefer a declarative, YAML-driven pipeline you can version and hand to a teammate. TRL itself is what you reach for when you need advanced training objectives beyond plain supervised fine-tuning, such as preference optimisation. A minimal QLoRA configuration with peft and trl looks like this:

from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantised base model — the QLoRA part
quant = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=quant,
    device_map="auto",
)

# LoRA adapters — start with r = alpha = 16
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,          # ~1,000 curated examples
    peft_config=peft_config,
    args=SFTConfig(
        output_dir="./adapter",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
    ),
)
trainer.train()
trainer.save_model("./adapter")     # tens of MB, not gigabytes

That is the whole shape of it. The model definition, the adapter config, the trainer, train, save. Everything else is data preparation and evaluation.

Data quality over quantity

This is the part that decides whether your fine-tune is worth shipping. Around 1,000 hand-curated examples often beat 100,000 noisy ones. A fine-tune learns the patterns in your data with uncomfortable fidelity — including the inconsistent formatting, the contradictory answers and the one in fifty examples where someone pasted the wrong response. Scale does not wash that out; it bakes it in.

Spend your effort on the dataset. Decide on one exact output format and enforce it on every example. Remove duplicates and near-duplicates. Read a random sample end to end and fix anything you would not be happy for the model to imitate. A regional-language support team in Bengaluru and a legal-summarisation studio in Edinburgh will both get more from 800 examples they have personally checked than from a scraped set ten times the size. Hold back a slice — perhaps 10 per cent — as a validation set you never train on, so your loss curve means something.

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

When not to fine-tune

The most cost-effective fine-tune is often the one you decide not to run. Fine-tuning changes how a model behaves — its tone, its format, its handling of a task type. It is a poor and expensive way to give a model new facts, because facts move and a fine-tuned model freezes them in place.

Work through this order before you train anything:

Better prompting first. If a clearer instruction or a stricter system prompt fixes the behaviour, you are done. No GPU, no dataset, no training run.
Few-shot examples next. Two or three worked examples in the prompt often teach a format more cheaply than a fine-tune, and you can change them in seconds.
RAG when the problem is knowledge. If the model needs your private documents or up-to-date information, retrieval-augmented generation is the right tool. It keeps facts editable and citable. Once retrieval works, how you serve the model matters — our comparison of vLLM, SGLang and TensorRT-LLM serving engines covers that decision.
Fine-tune when behaviour is the gap. Reach for LoRA or QLoRA when prompting and retrieval genuinely cannot reach the consistency, tone or format you need — for example a model that must always answer in one strict structured shape, or in a specific regional register.

Your first fine-tune this weekend: a checklist

If you want to go from reading this to a trained adapter in one afternoon, follow this sequence.

Confirm fine-tuning is the right tool. Run the prompting and RAG check above. If behaviour, not knowledge, is the gap, continue.
Pick a base model. A well-supported 7B-8B open-weight instruct model is the safe starting point.
Curate ~1,000 examples. One consistent output format, no duplicates, every example read by a human. Hold back 10 per cent for validation.
Choose your compute. A 12 GB consumer card if you own one, or rent an A100 or H100 on RunPod, Lambda or Vast.ai for the run.
Set up the stack. Python 3.11+, PyTorch 2.5+, CUDA 12.x, plus transformers, datasets, peft, trl, and Unsloth or Axolotl for the wrapper.
Train with QLoRA at r = alpha = 16. Three epochs, checkpoint to persistent storage, watch the validation loss.
Evaluate, then iterate. Test against held-back examples and real prompts. If the result is weak, fix the data before you touch the hyperparameters.

If you train something useful, publish what you learned. The builders who document a clean fine-tuning run get found — add your profile and the people hiring for exactly this skill can reach you.