Interview: Fine-Tuning LLMs
12 Q&A pairs on fine-tuning: LoRA vs full fine-tuning, rank selection, data requirements, DPO, catastrophic forgetting, evaluation, and production deployment.
Q1: What is the difference between full fine-tuning and LoRA?
A: Full fine-tuning updates all model parameters. For a 7B model, that's 7 billion gradient updates per step — requires 80–160 GB of GPU memory for training.
LoRA (Low-Rank Adaptation) freezes the original weights and adds small rank decomposition matrices. Only these adapter matrices are trained — roughly 0.1–1% of parameters. A 7B model fine-tuned with LoRA r=16 trains roughly 20M parameters instead of 7B, fitting on a single 24 GB GPU.
At inference, LoRA adapters can be merged into the base weights — zero overhead. This makes LoRA the standard for production fine-tuning.
Q2: What is QLoRA and how does it differ from LoRA?
A: QLoRA (Dettmers et al., 2023) combines two techniques:
- 4-bit quantization: Load the base model in 4-bit precision (NF4 quantization) instead of 16-bit — 4x memory reduction
- LoRA adapters: Train LoRA adapters in full 16-bit precision on top of the frozen 4-bit base
Result: a 70B model can be fine-tuned on 2× 48 GB GPUs instead of 8+ 80 GB GPUs.
Key config:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)Performance cost of quantization: roughly 1–3% accuracy drop vs full-precision LoRA. For most practical tasks, this is acceptable given the memory savings.
Q3: How do you choose the LoRA rank r?
A: Rank controls the parameter count and expressiveness of the LoRA adapter. Higher r = more parameters = more capacity to change the model, but also more risk of overfitting.
Practical starting points:
- r=4 or 8: format/style adaptation with small datasets
- r=16: domain adaptation (most common default)
- r=32 or 64: complex new tasks, large datasets
The alpha parameter (scaling factor) is typically set to alpha = 2r. The effective scaling is alpha/r, so alpha=32, r=16 gives scaling of 2.0.
Start at r=16. Plot validation loss vs r on a held-out set. If you see overfitting at r=16, reduce to r=8. If the model is still underfitting complex tasks, try r=32.
Q4: When should you NOT fine-tune, and use prompting or RAG instead?
A: Don't fine-tune if:
- You want to inject facts: LLMs struggle to reliably inject new knowledge via fine-tuning — RAG is far more reliable for factual retrieval
- Your task is solvable with prompting: Few-shot examples in the prompt can get you 80% of fine-tuning quality with 0 cost
- Your dataset is small: Under 200 high-quality examples, prompting usually outperforms fine-tuning
- You need rapid iteration: Fine-tuning takes hours; prompt changes take minutes
Fine-tune when: the behavior change requires more than a prompt can express, the task is well-defined with 1k+ examples, and you need consistent format or domain-specific tone that prompting alone can't achieve reliably.
Q5: What is catastrophic forgetting and how do you mitigate it?
A: Catastrophic forgetting: fine-tuning on a narrow dataset degrades the model's general capabilities — it "forgets" what it knew before.
Signs: MMLU benchmark drops significantly after fine-tuning, model fails to follow general instructions, model loses commonsense reasoning.
Mitigations:
- LoRA: Frozen base weights can't be forgotten — only the adapters change. LoRA significantly reduces catastrophic forgetting vs full fine-tuning
- Small learning rate: 1e-4 to 5e-5 for LoRA; 1e-5 for full fine-tuning
- Fewer epochs: Train for 1–3 epochs. More epochs = more forgetting
- General data mixing: Include 5–10% general instruction data in your training mix to anchor general capabilities
- Benchmark monitoring: Run MMLU and HellaSwag before and after. If drop exceeds 3%, reduce training intensity
Q6: What is the DPO training pipeline and how does it differ from SFT?
A: The full alignment pipeline:
- SFT (Supervised Fine-Tuning): Train on (prompt, response) pairs. Teaches the model what to say
- Preference data collection: Collect (prompt, chosen, rejected) triplets where "chosen" is the better response
- DPO: Train to increase probability of chosen responses relative to rejected, using the SFT model as reference
DPO loss directly maximizes log-ratio of chosen vs rejected, keeping the model close to the SFT reference via a KL penalty term (controlled by beta).
Key difference from SFT: DPO uses contrastive pairs — it knows which response is better, not just what the response is. This teaches qualitative preferences, not just content.
RLHF achieves similar results but requires training a separate reward model and using PPO. DPO skips both, making it the practical standard.
Q7: How do you prepare a high-quality fine-tuning dataset?
A: Data quality dominates fine-tuning results. Key steps:
- Define the target behavior explicitly — write 20+ examples of what "good" looks like before collecting any data
- Format correctly — use the model's native chat template (apply_chat_template), apply response masking so loss is computed only on responses
- Quality filter — minimum response length, reject refusals and vague answers, filter duplicates with embedding similarity
- Expert review — have domain experts review a random 10% sample; flag error rate
- Deduplicate — semantic deduplication removes near-identical prompts that waste training budget
For domain-specific tasks: 200–500 expert-curated examples consistently outperform 50,000 auto-generated examples.
Q8: How do you evaluate a fine-tuned model?
A: Multi-level evaluation:
- Training loss curve: Should decrease smoothly. Divergence = learning rate too high. Plateau early = data too easy or too small
- Validation loss: Tracks generalization. Rising while training loss falls = overfitting
- Task-specific metric: For classification, accuracy. For generation, use LLM-as-judge scoring on relevant dimensions (accuracy, completeness, tone)
- A/B vs base model: Compare outputs side-by-side. Fine-tuned model should win significantly on domain tasks
- Benchmark regression: Run MMLU, HellaSwag, TruthfulQA. Acceptable regression: under 3%
- Regression tests: Explicitly test for safety/refusal behaviors the base model had
Never ship based on training loss alone. A model that looks good on training metrics can fail spectacularly on real user queries.
Q9: What target modules should you apply LoRA to?
A: Most impactful with fewest parameters: the attention projection matrices — q_proj and v_proj.
A broader target that captures more signal:
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]Including feed-forward layers (gate_proj, up_proj, down_proj) roughly triples trainable parameters and can help when the task requires learning new factual associations.
Find module names by printing the model:
for name, _ in model.named_modules():
if "proj" in name:
print(name)For most domain adaptation tasks: start with q_proj + v_proj. Add more modules if validation metrics don't improve.
Q10: System design — design a fine-tuning pipeline for a clinical drug information assistant.
A:
Phase 1: Data collection
- Source: FDA drug labels, clinical guidelines, pharmacology textbooks
- Generate 5,000 (question, expert answer) pairs using GPT-4o as generator
- Expert pharmacist reviews random 200-example sample; target error rate under 2%
- Apply quality filters: minimum 80 words, no refusals, clinical terminology present
Phase 2: SFT
- Base model: Llama-3.1-8B-Instruct (instruction-tuned base)
- Config: QLoRA (4-bit), r=16, alpha=32, target q/v/k/o projections
- Training: 3 epochs, learning rate 2e-4 with cosine schedule, batch size 8
- Monitor: train loss, validation loss on 500 held-out examples
Phase 3: DPO alignment
- Collect 1,000 preference pairs: GPT-4o generates high-quality and deliberately weak responses
- Pharmacist reviews 100 pairs to verify chosen/rejected labeling is correct
- DPO: beta=0.1, 2 epochs, learning rate 5e-5
Phase 4: Evaluation
- Task benchmark: 200 clinical pharmacology MCQ → target 85% accuracy
- Safety regression: refusal tests, sycophancy tests
- MMLU: verify regression is under 3%
- Human eval: 50 real queries rated by pharmacist
Deployment: Serve merged model (LoRA merged into weights) via vLLM for throughput.
Q11: What is the optimal number of training epochs for LoRA fine-tuning?
A: Typically 1–3 epochs. More epochs = more overfitting to the training distribution and more catastrophic forgetting.
Rule of thumb:
- 1 epoch: for very large datasets (50k+)
- 2–3 epochs: for medium datasets (1k–50k)
- Up to 5 epochs: for very small datasets with high-quality data (under 500 examples)
Always use a validation set to detect overfitting. If validation loss starts rising while training loss continues falling, stop training (early stopping).
The biggest mistake: training for 10+ epochs because "more training = better." With fine-tuning, this is almost never true.
Q12: How does response masking work and why is it important?
A: Response masking (completion-only training): compute the training loss only on the assistant's response tokens, not on the system prompt and user message.
Without masking: the model is trained to predict the prompt — it learns to expect prompts in a very specific format and doesn't generalize to variations.
With masking: only the response tokens contribute to the gradient update — the model learns what to say, not how the prompt looks.
Implementation in TRL:
from trl import DataCollatorForCompletionOnlyLM
collator = DataCollatorForCompletionOnlyLM(
response_template="<|start_header_id|>assistant<|end_header_id|>",
tokenizer=tokenizer,
)The collator replaces prompt token labels with -100 (ignored by cross-entropy loss), so only response tokens drive learning. This is essential for instruction-following fine-tuning — without it, you're training a next-token predictor on prompt formatting, not a response generator.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.