Learnixo

Fine-Tuning LLMs · Lesson 16 of 16

Interview: Fine-Tuning Scenario Questions

Q1: What is the difference between full fine-tuning and LoRA?

A: Full fine-tuning updates all model parameters. For a 7B model, that's 7 billion gradient updates per step — requires 80–160 GB of GPU memory for training.

LoRA (Low-Rank Adaptation) freezes the original weights and adds small rank decomposition matrices. Only these adapter matrices are trained — roughly 0.1–1% of parameters. A 7B model fine-tuned with LoRA r=16 trains roughly 20M parameters instead of 7B, fitting on a single 24 GB GPU.

At inference, LoRA adapters can be merged into the base weights — zero overhead. This makes LoRA the standard for production fine-tuning.


Q2: What is QLoRA and how does it differ from LoRA?

A: QLoRA (Dettmers et al., 2023) combines two techniques:

  1. 4-bit quantization: Load the base model in 4-bit precision (NF4 quantization) instead of 16-bit — 4x memory reduction
  2. LoRA adapters: Train LoRA adapters in full 16-bit precision on top of the frozen 4-bit base

Result: a 70B model can be fine-tuned on 2× 48 GB GPUs instead of 8+ 80 GB GPUs.

Key config:

Python
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

Performance cost of quantization: roughly 1–3% accuracy drop vs full-precision LoRA. For most practical tasks, this is acceptable given the memory savings.


Q3: How do you choose the LoRA rank r?

A: Rank controls the parameter count and expressiveness of the LoRA adapter. Higher r = more parameters = more capacity to change the model, but also more risk of overfitting.

Practical starting points:

  • r=4 or 8: format/style adaptation with small datasets
  • r=16: domain adaptation (most common default)
  • r=32 or 64: complex new tasks, large datasets

The alpha parameter (scaling factor) is typically set to alpha = 2r. The effective scaling is alpha/r, so alpha=32, r=16 gives scaling of 2.0.

Start at r=16. Plot validation loss vs r on a held-out set. If you see overfitting at r=16, reduce to r=8. If the model is still underfitting complex tasks, try r=32.


Q4: When should you NOT fine-tune, and use prompting or RAG instead?

A: Don't fine-tune if:

  • You want to inject facts: LLMs struggle to reliably inject new knowledge via fine-tuning — RAG is far more reliable for factual retrieval
  • Your task is solvable with prompting: Few-shot examples in the prompt can get you 80% of fine-tuning quality with 0 cost
  • Your dataset is small: Under 200 high-quality examples, prompting usually outperforms fine-tuning
  • You need rapid iteration: Fine-tuning takes hours; prompt changes take minutes

Fine-tune when: the behavior change requires more than a prompt can express, the task is well-defined with 1k+ examples, and you need consistent format or domain-specific tone that prompting alone can't achieve reliably.


Q5: What is catastrophic forgetting and how do you mitigate it?

A: Catastrophic forgetting: fine-tuning on a narrow dataset degrades the model's general capabilities — it "forgets" what it knew before.

Signs: MMLU benchmark drops significantly after fine-tuning, model fails to follow general instructions, model loses commonsense reasoning.

Mitigations:

  • LoRA: Frozen base weights can't be forgotten — only the adapters change. LoRA significantly reduces catastrophic forgetting vs full fine-tuning
  • Small learning rate: 1e-4 to 5e-5 for LoRA; 1e-5 for full fine-tuning
  • Fewer epochs: Train for 1–3 epochs. More epochs = more forgetting
  • General data mixing: Include 5–10% general instruction data in your training mix to anchor general capabilities
  • Benchmark monitoring: Run MMLU and HellaSwag before and after. If drop exceeds 3%, reduce training intensity

Q6: What is the DPO training pipeline and how does it differ from SFT?

A: The full alignment pipeline:

  1. SFT (Supervised Fine-Tuning): Train on (prompt, response) pairs. Teaches the model what to say
  2. Preference data collection: Collect (prompt, chosen, rejected) triplets where "chosen" is the better response
  3. DPO: Train to increase probability of chosen responses relative to rejected, using the SFT model as reference

DPO loss directly maximizes log-ratio of chosen vs rejected, keeping the model close to the SFT reference via a KL penalty term (controlled by beta).

Key difference from SFT: DPO uses contrastive pairs — it knows which response is better, not just what the response is. This teaches qualitative preferences, not just content.

RLHF achieves similar results but requires training a separate reward model and using PPO. DPO skips both, making it the practical standard.


Q7: How do you prepare a high-quality fine-tuning dataset?

A: Data quality dominates fine-tuning results. Key steps:

  1. Define the target behavior explicitly — write 20+ examples of what "good" looks like before collecting any data
  2. Format correctly — use the model's native chat template (apply_chat_template), apply response masking so loss is computed only on responses
  3. Quality filter — minimum response length, reject refusals and vague answers, filter duplicates with embedding similarity
  4. Expert review — have domain experts review a random 10% sample; flag error rate
  5. Deduplicate — semantic deduplication removes near-identical prompts that waste training budget

For domain-specific tasks: 200–500 expert-curated examples consistently outperform 50,000 auto-generated examples.


Q8: How do you evaluate a fine-tuned model?

A: Multi-level evaluation:

  • Training loss curve: Should decrease smoothly. Divergence = learning rate too high. Plateau early = data too easy or too small
  • Validation loss: Tracks generalization. Rising while training loss falls = overfitting
  • Task-specific metric: For classification, accuracy. For generation, use LLM-as-judge scoring on relevant dimensions (accuracy, completeness, tone)
  • A/B vs base model: Compare outputs side-by-side. Fine-tuned model should win significantly on domain tasks
  • Benchmark regression: Run MMLU, HellaSwag, TruthfulQA. Acceptable regression: under 3%
  • Regression tests: Explicitly test for safety/refusal behaviors the base model had

Never ship based on training loss alone. A model that looks good on training metrics can fail spectacularly on real user queries.


Q9: What target modules should you apply LoRA to?

A: Most impactful with fewest parameters: the attention projection matrices — q_proj and v_proj.

A broader target that captures more signal:

Python
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]

Including feed-forward layers (gate_proj, up_proj, down_proj) roughly triples trainable parameters and can help when the task requires learning new factual associations.

Find module names by printing the model:

Python
for name, _ in model.named_modules():
    if "proj" in name:
        print(name)

For most domain adaptation tasks: start with q_proj + v_proj. Add more modules if validation metrics don't improve.


Q10: System design — design a fine-tuning pipeline for a clinical drug information assistant.

A:

Phase 1: Data collection

  • Source: FDA drug labels, clinical guidelines, pharmacology textbooks
  • Generate 5,000 (question, expert answer) pairs using GPT-4o as generator
  • Expert pharmacist reviews random 200-example sample; target error rate under 2%
  • Apply quality filters: minimum 80 words, no refusals, clinical terminology present

Phase 2: SFT

  • Base model: Llama-3.1-8B-Instruct (instruction-tuned base)
  • Config: QLoRA (4-bit), r=16, alpha=32, target q/v/k/o projections
  • Training: 3 epochs, learning rate 2e-4 with cosine schedule, batch size 8
  • Monitor: train loss, validation loss on 500 held-out examples

Phase 3: DPO alignment

  • Collect 1,000 preference pairs: GPT-4o generates high-quality and deliberately weak responses
  • Pharmacist reviews 100 pairs to verify chosen/rejected labeling is correct
  • DPO: beta=0.1, 2 epochs, learning rate 5e-5

Phase 4: Evaluation

  • Task benchmark: 200 clinical pharmacology MCQ → target 85% accuracy
  • Safety regression: refusal tests, sycophancy tests
  • MMLU: verify regression is under 3%
  • Human eval: 50 real queries rated by pharmacist

Deployment: Serve merged model (LoRA merged into weights) via vLLM for throughput.


Q11: What is the optimal number of training epochs for LoRA fine-tuning?

A: Typically 1–3 epochs. More epochs = more overfitting to the training distribution and more catastrophic forgetting.

Rule of thumb:

  • 1 epoch: for very large datasets (50k+)
  • 2–3 epochs: for medium datasets (1k–50k)
  • Up to 5 epochs: for very small datasets with high-quality data (under 500 examples)

Always use a validation set to detect overfitting. If validation loss starts rising while training loss continues falling, stop training (early stopping).

The biggest mistake: training for 10+ epochs because "more training = better." With fine-tuning, this is almost never true.


Q12: How does response masking work and why is it important?

A: Response masking (completion-only training): compute the training loss only on the assistant's response tokens, not on the system prompt and user message.

Without masking: the model is trained to predict the prompt — it learns to expect prompts in a very specific format and doesn't generalize to variations.

With masking: only the response tokens contribute to the gradient update — the model learns what to say, not how the prompt looks.

Implementation in TRL:

Python
from trl import DataCollatorForCompletionOnlyLM

collator = DataCollatorForCompletionOnlyLM(
    response_template="<|start_header_id|>assistant<|end_header_id|>",
    tokenizer=tokenizer,
)

The collator replaces prompt token labels with -100 (ignored by cross-entropy loss), so only response tokens drive learning. This is essential for instruction-following fine-tuning — without it, you're training a next-token predictor on prompt formatting, not a response generator.