LLMs Deep Dive · Lesson 22 of 24
Interview: How Would You Fine-Tune an LLM?
Q: What is the difference between pretraining and fine-tuning?
Pretraining: self-supervised training on massive text corpora (2T+ tokens) with next-token prediction. No human labels. The model learns language, factual knowledge, reasoning patterns, and code. Takes thousands of GPU-hours for small models, millions for large ones. Produces the base model.
Fine-tuning: supervised or preference-based training on a smaller, high-quality dataset. May be:
- SFT (supervised fine-tuning): learn to follow instructions
- DPO: learn from preference pairs
- Domain adaptation: continue pretraining on medical/legal/code data
Fine-tuning starts from the pretrained checkpoint and runs for fewer steps (hours to days vs months).
Q: What is LoRA and when would you use it?
LoRA (Low-Rank Adaptation) freezes the pretrained model weights and adds small trainable matrices to attention layers:
Original: W ∈ ℝ^(d × k) (frozen)
LoRA: W' = W + BA where B ∈ ℝ^(d × r), A ∈ ℝ^(r × k), r << min(d, k)
Parameter ratio: 2r/(d+k)
For d=k=4096, r=16: 2×16/(4096+4096) = 0.39% of original parametersLoRA is used when full fine-tuning is too expensive (7B model × fp16 × optimizer states ≈ 56GB, requiring multiple A100s) but you need task-specific adaptation. QLoRA (quantised LoRA) quantises the frozen base model to 4-bit, enabling 7B fine-tuning on a single consumer GPU (24GB).
Q: What is instruction tuning?
Instruction tuning trains the model on (instruction, response) pairs to follow human directives. FLAN was the first large-scale instruction-tuned model. The format is:
System: "You are a helpful medical assistant."
User: "What is the mechanism of Warfarin?"
Assistant: "Warfarin inhibits Vitamin K epoxide reductase..."Instruction tuning teaches the model to respond to instructions rather than continuing text. LLaMA 2-Chat, Mistral-Instruct, and Zephyr are all instruction-tuned variants of base models.
Q: How does the training loss change during LLM training?
Typical training dynamics:
First 1-5% of training (warmup):
LR ramps from 0 to peak
Loss decreases rapidly (model learns basic patterns)
Main training:
Loss decreases smoothly following a power law
Occasional loss spikes (fixed by gradient clipping)
Cooldown (last 5-10% of steps):
LR decays via cosine schedule
Loss decreases more steeply — often comparable to 10× more steps at peak LR
Most quality gain in the final training stages
Model merging:
Take a few checkpoints from the end of training
Average their weights (simple ensembling)
Often improves quality ~1-3% with zero inference costQ: What is catastrophic forgetting and how is it handled?
When a pretrained model is fine-tuned on new data, it may forget previously learned knowledge — a phenomenon called catastrophic forgetting. The FFN and attention weights adapted to the new task may overwrite general knowledge.
Mitigations:
- Low learning rate: fine-tune at 1/10th the pretraining LR
- LoRA: frozen base weights can't forget; only low-rank adapters change
- Replay: include some original pretraining data in the fine-tuning mix
- Elastic weight consolidation (EWC): penalise changes to parameters important for previous tasks
Q: What are the key hyperparameters for LLM fine-tuning?
Learning rate:
Typically 1e-5 to 1e-4 for SFT (10-100× lower than pretraining)
LoRA: slightly higher LR acceptable (1e-4 to 3e-4)
Batch size:
Effective batch = micro_batch × gradient_accumulation_steps × gpus
Target: 16-64 for SFT (smaller than pretraining)
Epochs:
SFT: 1-3 epochs on instruction data (more → overfitting)
DPO: 1-3 epochs on preference pairs
Max sequence length:
Set to match your deployment context (e.g., 4096)
Longer = more memory; pack sequences to avoid padding waste
LoRA rank r:
r=8: good for most tasks
r=64: better for significant domain shifts
r=256+: approaching full fine-tuning parameter countQ: How do you prevent overfitting during fine-tuning?
1. Early stopping: monitor validation loss; stop when it increases
2. Low learning rate: reduce parameter magnitude changes
3. LoRA: structural regularisation (can't overfit all weights)
4. Dropout on LoRA adapters (r=0.1 is common)
5. Small number of epochs (1-3)
6. Mix fine-tuning data with pretraining data (prevents forgetting)
7. High-quality data: fewer examples of better quality > more noisy examplesInterview Answer Template
"LLM training has three phases: pretraining (next-token prediction on 1-15T tokens — learns language, knowledge, reasoning), supervised fine-tuning (instruction pairs — teaches the model to respond helpfully), and alignment (DPO or RLHF — teaches the model to be safe and preferred). LoRA is the standard efficient fine-tuning technique: freeze the base model, add small trainable low-rank adapter matrices. QLoRA enables fine-tuning a 7B model on a single 24GB GPU by quantising the frozen base to 4-bit. Key hyperparameters are learning rate (10-100× lower than pretraining), epochs (1-3 to avoid overfitting), and LoRA rank (8-64 depending on task complexity)."