Learnixo

LLMs Deep Dive · Lesson 22 of 24

Interview: How Would You Fine-Tune an LLM?

Q: What is the difference between pretraining and fine-tuning?

Pretraining: self-supervised training on massive text corpora (2T+ tokens) with next-token prediction. No human labels. The model learns language, factual knowledge, reasoning patterns, and code. Takes thousands of GPU-hours for small models, millions for large ones. Produces the base model.

Fine-tuning: supervised or preference-based training on a smaller, high-quality dataset. May be:

  • SFT (supervised fine-tuning): learn to follow instructions
  • DPO: learn from preference pairs
  • Domain adaptation: continue pretraining on medical/legal/code data

Fine-tuning starts from the pretrained checkpoint and runs for fewer steps (hours to days vs months).


Q: What is LoRA and when would you use it?

LoRA (Low-Rank Adaptation) freezes the pretrained model weights and adds small trainable matrices to attention layers:

Original: W ∈ ℝ^(d × k)  (frozen)
LoRA:     W' = W + BA  where B ∈ ℝ^(d × r), A ∈ ℝ^(r × k), r << min(d, k)

Parameter ratio: 2r/(d+k)
  For d=k=4096, r=16: 2×16/(4096+4096) = 0.39% of original parameters

LoRA is used when full fine-tuning is too expensive (7B model × fp16 × optimizer states ≈ 56GB, requiring multiple A100s) but you need task-specific adaptation. QLoRA (quantised LoRA) quantises the frozen base model to 4-bit, enabling 7B fine-tuning on a single consumer GPU (24GB).


Q: What is instruction tuning?

Instruction tuning trains the model on (instruction, response) pairs to follow human directives. FLAN was the first large-scale instruction-tuned model. The format is:

System:    "You are a helpful medical assistant."
User:      "What is the mechanism of Warfarin?"
Assistant: "Warfarin inhibits Vitamin K epoxide reductase..."

Instruction tuning teaches the model to respond to instructions rather than continuing text. LLaMA 2-Chat, Mistral-Instruct, and Zephyr are all instruction-tuned variants of base models.


Q: How does the training loss change during LLM training?

Typical training dynamics:

First 1-5% of training (warmup):
  LR ramps from 0 to peak
  Loss decreases rapidly (model learns basic patterns)

Main training:
  Loss decreases smoothly following a power law
  Occasional loss spikes (fixed by gradient clipping)

Cooldown (last 5-10% of steps):
  LR decays via cosine schedule
  Loss decreases more steeply — often comparable to 10× more steps at peak LR
  Most quality gain in the final training stages

Model merging:
  Take a few checkpoints from the end of training
  Average their weights (simple ensembling)
  Often improves quality ~1-3% with zero inference cost

Q: What is catastrophic forgetting and how is it handled?

When a pretrained model is fine-tuned on new data, it may forget previously learned knowledge — a phenomenon called catastrophic forgetting. The FFN and attention weights adapted to the new task may overwrite general knowledge.

Mitigations:

  • Low learning rate: fine-tune at 1/10th the pretraining LR
  • LoRA: frozen base weights can't forget; only low-rank adapters change
  • Replay: include some original pretraining data in the fine-tuning mix
  • Elastic weight consolidation (EWC): penalise changes to parameters important for previous tasks

Q: What are the key hyperparameters for LLM fine-tuning?

Learning rate:
  Typically 1e-5 to 1e-4 for SFT (10-100× lower than pretraining)
  LoRA: slightly higher LR acceptable (1e-4 to 3e-4)

Batch size:
  Effective batch = micro_batch × gradient_accumulation_steps × gpus
  Target: 16-64 for SFT (smaller than pretraining)

Epochs:
  SFT: 1-3 epochs on instruction data (more → overfitting)
  DPO: 1-3 epochs on preference pairs

Max sequence length:
  Set to match your deployment context (e.g., 4096)
  Longer = more memory; pack sequences to avoid padding waste

LoRA rank r:
  r=8: good for most tasks
  r=64: better for significant domain shifts
  r=256+: approaching full fine-tuning parameter count

Q: How do you prevent overfitting during fine-tuning?

1. Early stopping: monitor validation loss; stop when it increases
2. Low learning rate: reduce parameter magnitude changes
3. LoRA: structural regularisation (can't overfit all weights)
4. Dropout on LoRA adapters (r=0.1 is common)
5. Small number of epochs (1-3)
6. Mix fine-tuning data with pretraining data (prevents forgetting)
7. High-quality data: fewer examples of better quality > more noisy examples

Interview Answer Template

"LLM training has three phases: pretraining (next-token prediction on 1-15T tokens — learns language, knowledge, reasoning), supervised fine-tuning (instruction pairs — teaches the model to respond helpfully), and alignment (DPO or RLHF — teaches the model to be safe and preferred). LoRA is the standard efficient fine-tuning technique: freeze the base model, add small trainable low-rank adapter matrices. QLoRA enables fine-tuning a 7B model on a single 24GB GPU by quantising the frozen base to 4-bit. Key hyperparameters are learning rate (10-100× lower than pretraining), epochs (1-3 to avoid overfitting), and LoRA rank (8-64 depending on task complexity)."