Interview: LLMs Deep Dive (Part 2)
10 more senior-level interview questions: emergent capabilities, fine-tuning decisions, context extension, alignment tradeoffs, and production LLM system design.
Q1: What is emergence in LLMs and how does it affect production AI planning?
Answer: Emergence refers to capabilities that appear suddenly as model scale increases — absent in smaller models and present in larger ones, without being explicitly trained for. Examples: chain-of-thought reasoning emerges around 100B parameters; 3-digit arithmetic emerges around 8B; in-context few-shot learning emerges around 13B.
Why it's not just gradual improvement: On binary metrics (correct/incorrect), a model that goes from 49% to 51% on a multi-step sub-task appears to jump from 0% to 100% on the full task (requires all sub-tasks). This creates the appearance of sudden capability emergence.
Production implications:
- Don't extrapolate current model limitations to next-generation models — a task your current model fails may work well after a model upgrade
- Run full eval suites on model upgrades, not just spot checks
- Safety evaluations are model-version-specific — a safe 7B system may behave differently as a 70B system
- Budget for capability surprises: some tasks you paid humans to do may be automatable after a model upgrade
Counter-consideration (Schaeffer et al., 2023): Many emergence results are metric artifacts. Measuring with smooth metrics (log-prob) shows gradual improvement rather than sudden jumps. The discontinuity is in the measurement, not the model. This is relevant when choosing evaluation metrics — binary pass/fail can hide gradual progress.
Q2: When do you choose fine-tuning over prompt engineering, and vice versa?
Answer:
Choose prompt engineering when:
- The task is within the model's pretraining distribution (the model already has the knowledge)
- You need to iterate quickly — prompt changes deploy in minutes
- Format and behavioral constraints are the main need
- The task type changes frequently (general-purpose system)
- You're at an early stage without a clear target task
Choose fine-tuning when:
- Prompt engineering has plateaued — you've optimized the prompt and still can't hit target quality
- Consistent format is critical and prompts can't achieve it reliably (structured output, specific voice)
- The task requires knowledge not in pretraining (proprietary terminology, internal process knowledge)
- Cost at scale — a smaller fine-tuned model can outperform prompting a much larger model at lower cost
- Latency — a fine-tuned 7B model may be fast enough where GPT-4 is too slow
Process: Start with prompt engineering. Build an eval suite first. When prompt optimization hits diminishing returns, use your eval suite to measure fine-tuning gains. If fine-tuning adds more than 5-10% on your eval suite and the dataset cost is justified, switch.
Common mistake: Fine-tuning without an eval suite. Then you don't know if fine-tuning helped.
Q3: Explain how Grouped Query Attention (GQA) works and why it matters for inference.
Answer: Standard Multi-Head Attention (MHA) has one key and value projection per attention head. In a 32-head model, you have 32 K and 32 V matrices — all stored in the KV cache.
GQA groups query heads to share K/V pairs:
- 32 query heads
- 8 KV heads (groups of 4 query heads share the same K, V)
KV cache reduction: 4× smaller KV cache for the same model size. For a 70B model at 32 context length:
- MHA: 80 layers × 64 heads × 128 head_dim × 2 (K, V) × 2 bytes = 2.7 GB per 4K context
- GQA (8 KV): 0.67 GB per 4K context
This 4× reduction in KV cache memory allows:
- Larger batch sizes (more requests per GPU)
- Longer contexts within the same memory budget
- More memory available for model weights (enabling larger models on the same hardware)
Quality tradeoff: Empirically minimal — LLaMA-3 8B uses 8 KV heads vs 32 query heads with minimal quality degradation. The query heads can still differentiate different attention patterns even when sharing K/V.
Multi-Query Attention (MQA) goes further — all query heads share a single K, V. More memory efficient but larger quality drop. GQA is the sweet spot.
Q4: What are the main failure modes of RLHF and how does DPO address them?
Answer:
RLHF/PPO failure modes:
-
Reward hacking: The policy finds inputs that score high on the RM without actually being better. Common: verbosity (if RM associates length with quality), certain phrases, sycophantic responses.
-
KL instability: If KL coefficient is too low, the policy diverges and generates nonsensical text. Too high, and alignment barely happens.
-
Reward model overfitting: If RM training data is small, the RM is overfit to specific patterns and generalizes poorly to novel inputs.
-
Online generation bottleneck: PPO requires generating samples during training, which is slow and memory-intensive. Hard to parallelize efficiently.
-
4-model memory requirement: Policy, reference, reward, and value head all in GPU memory simultaneously.
How DPO addresses these:
- No reward model: Removes RM overfitting and reward hacking as failure modes
- Offline training: Trains on a fixed preference dataset — no slow online generation step
- 2-model memory: Policy and reference model only (no RM, no value head)
- Simpler hyperparameters: Beta (KL coefficient) is the main knob; no clip ratio, value loss coefficient, etc.
DPO's own failure modes:
- Sensitive to preference data quality — noisy preferences hurt more than in PPO
- Offline learning: can't improve beyond what's in the preference dataset
- Requires good SFT initialization — DPO on top of a weak SFT doesn't work well
Q5: Describe the NTK-aware RoPE scaling technique and why it works.
Answer: RoPE (Rotary Position Embedding) encodes position by rotating query and key vectors. The rotation frequency for dimension pair d is: θ_d = 1 / (base^(2d/D)). At training max length L, positions 0 to L-1 are seen. Beyond L, the model receives position indices outside its training distribution.
Linear scaling problem: Dividing positions by a scale factor s = L_target/L_train maps position L_target to L_train. But position 2L_train and L_train are now both mapped to L_train — the model can't distinguish them (position aliasing).
NTK-aware scaling intuition: Instead of scaling positions, scale the RoPE base frequency. A larger base means lower frequencies at every dimension, effectively stretching all wavelengths proportionally:
base_scaled = base × (L_target/L_train)^(D/(D-2))Why this works: High-frequency RoPE dimensions handle local position structure (nearby token relationships); low-frequency dimensions handle global position (long-range relationships). NTK scaling stretches all dimensions so that:
- Local structure is interpolated between trained positions (works well)
- Global structure is extrapolated (handled by training on nearby positions at test time)
The key mathematical insight: this approach avoids out-of-distribution position values by mapping them back into the trained range via the frequency change, rather than directly scaling positions.
Practical result: NTK scaling can extend a model trained at 8K tokens to 32K tokens with under 0.3 perplexity increase, without any fine-tuning. With a few hundred steps of fine-tuning on long documents, the extension holds at 4-8× the original context length.
Q6: How do you evaluate an LLM for clinical use, and what metrics matter?
Answer: Generic benchmarks (MMLU, HumanEval) are necessary but not sufficient for clinical deployment. A rigorous clinical evaluation requires domain-specific assessment:
Evaluation layers:
1. Retrieval evaluation (for RAG systems):
- Recall@5: Are the relevant documents retrieved in the top 5?
- MRR (Mean Reciprocal Rank): Average rank of the first relevant document
2. Factual accuracy by category:
- Drug interaction severity: compare to Lexicomp/Micromedex ground truth
- Renal dosing: compare to KDIGO guidelines or FDA prescribing information
- Pharmacokinetics: compare to established reference values
- Per-category pass rate with human pharmacist review on failed cases
3. Safety-critical behaviors:
- Refusal rate on clearly harmful requests (target: 100%)
- False refusal rate on appropriate clinical questions (target: under 5%)
- Uncertainty expression: does the model say "I'm not certain" when it should?
4. Adversarial robustness:
- Injection resistance: does asking about a fictitious drug result in fabricated information?
- Authority claim resistance: does claiming to be a doctor unlock restricted content?
- Sycophancy: does the model maintain correct answers when users push back?
5. Clinical bias:
- Do response quality or completeness differ across patient demographics?
- Gender, race, age-based variation in pharmacotherapy recommendations
Measurement approach: LLM-as-judge with a fixed rubric for scalable evaluation; pharmacist review for calibration of the judge; category-specific thresholds (safety must be 100%, accuracy target varies by risk level).
Q7: Explain the Chinchilla scaling laws and their practical implications.
Answer: The Chinchilla paper (Hoffmann et al., DeepMind, 2022) established the optimal compute allocation between model size and training data:
The finding: For a given compute budget C, the optimal model has approximately N* ≈ (C/6)^0.5 parameters, trained on D* ≈ 20N* tokens.
Rule of thumb: 20 tokens per parameter for optimal training.
This overturned a prior assumption from the Kaplan et al. (2020) scaling laws, which concluded that model size should scale faster than data. Kaplan's models were undertrained; Chinchilla shows equal scaling of both.
Practical implications:
-
GPT-3 (175B, 300B tokens) was undertrained. Chinchilla-optimal for 175B parameters is 3.5 trillion tokens. LLaMA and subsequent models applied this insight — LLaMA-3-8B was trained on 15 trillion tokens (1875 tokens/parameter).
-
Smaller, well-trained models beat larger undertrained models. Chinchilla 70B outperformed GPT-3 175B on most benchmarks despite having fewer parameters, because it was trained optimally.
-
Inference cost matters. A 7B model trained on 140B tokens may have the same loss as a 70B model trained on 14B tokens — but at inference, the 7B model is 10× cheaper to run. Training is a one-time cost; inference is ongoing.
-
Practical divergence: In practice, researchers often train smaller models on far more data than Chinchilla-optimal for the benefit of cheaper inference. LLaMA-3-8B is "over-trained" by Chinchilla standards but produces a small, high-quality model suitable for deployment.
Q8: What is catastrophic forgetting and how do methods like LoRA mitigate it?
Answer: Catastrophic forgetting is the tendency for neural networks to forget previously learned tasks when fine-tuned on new ones. When you fine-tune a pretrained LLM on a narrow domain, it can "overwrite" general language capabilities with domain-specific patterns.
Why it happens: Gradient updates during fine-tuning move all parameters toward the new objective. Weights that encoded general capabilities are modified to better fit the new domain data, destroying the prior encoding.
LoRA mitigation: LoRA (Low-Rank Adaptation) freezes all original weights and adds small adapter matrices (A and B of rank r) to selected weight matrices. Only A and B are trained.
W_adapted = W_pretrained + αABSince W_pretrained is frozen, the original capabilities are preserved. The adapters learn the domain-specific adjustments on top of the frozen base.
Why this works for catastrophic forgetting:
- The pretrained weights encode general knowledge and remain unchanged
- Adapters encode only the delta for the new domain
- If you remove the adapters, you get the original model back perfectly
- Multiple adapter sets can exist (swap adapters to switch domains without retraining)
Additional techniques:
- Elastic Weight Consolidation (EWC): Add a regularization term that penalizes large changes to weights that were important for previous tasks
- Continual Learning with replay: Mix some pretraining data into fine-tuning to maintain general capabilities
- Parameter-efficient methods generally: Adapters, prefix tuning, and prompt tuning all preserve base model weights
Q9: How does Constitutional AI (CAI) differ from RLHF for alignment, and what are its advantages?
Answer: RLHF relies on human preference labelers to score responses and generate training signal. CAI (Anthropic, 2022) replaces most human labeling with AI-generated preference labels guided by a written constitution.
CAI stages:
-
SL-CAI (Supervised Learning): Generate responses, critique them against the constitution (via self-critique), revise them. Fine-tune on the revised responses. This teaches the model to be helpful and harmless from principles, not just demonstrations.
-
RLAIF (RL from AI Feedback): Generate preference pairs by asking a separate model to evaluate responses against the constitution. Use these AI-generated preferences to train a reward model, then apply PPO.
Advantages over RLHF:
- Scalability: AI labeling scales to billions of pairs; human labeling scales to hundreds of thousands
- Consistency: AI applies the same principles consistently; human labelers have 25-30% inter-rater disagreement
- Transparency: The constitution is a human-readable document — you can inspect and modify what the model is optimizing for
- Reduced human exposure: Human labelers in RLHF are exposed to harmful content to label it; AI labelers aren't harmed
Tradeoffs:
- The AI labeler inherits biases from its own training
- Constitutional principles must be carefully written to avoid unintended interpretations
- May miss harm types not covered by the constitution
Practical outcome: Anthropic found CAI-trained models were significantly less harmful with comparable helpfulness to RLHF — and at lower human labeling cost.
Q10: System design — how would you build a multi-modal clinical documentation assistant?
Scenario: Clinicians dictate notes, attach lab images and ECGs. Build an AI system that generates structured clinical documentation.
Answer:
Input pipeline:
Audio recording → Whisper (STT) → Clinical transcript
Lab images → GPT-4V → Structured lab result extraction
ECG images → GPT-4V + clinical prompt → Rhythm classification + interval measurementsProcessing pipeline:
def process_clinical_encounter(
audio_path: str,
image_paths: list[str],
patient_context: dict,
) -> dict:
"""Multi-modal clinical documentation generation."""
# 1. Transcribe audio
transcript = transcribe_clinical_audio(audio_path)
# 2. Extract structured data from images
structured_data = []
for img_path in image_paths:
img_type = classify_image_type(img_path) # lab, ecg, x-ray
if img_type == "lab_report":
structured_data.append(extract_lab_results(img_path))
elif img_type == "ecg":
structured_data.append(analyze_ecg(img_path))
# 3. Generate structured documentation
documentation = generate_clinical_note(
transcript=transcript,
structured_data=structured_data,
patient_context=patient_context,
)
return documentationSafety and compliance:
- All PHI stays on-premises — use self-hosted models (LLaMA-3, Whisper)
- Audit log every extraction and generation with model version
- Human review required before EHR submission (AI as decision support, not final documentation)
- Output validation: structured fields (ICD codes, drug names) checked against validated dictionaries
Quality evaluation:
- Character error rate (CER) on transcript vs gold standard
- Structured field extraction accuracy vs human extraction on 200 test cases
- Documentation completeness scoring by clinician reviewers
- Downstream: does the AI-assisted documentation reduce documentation time? (hours per clinician per week)
Model choices:
- STT: Whisper Large V3 (on-premises, HIPAA-compliant)
- Vision: GPT-4o (with BAA) or Claude Sonnet 4.6 (with BAA)
- Note generation: Claude Sonnet (strong at structured clinical text)
- Validation: rule-based checks + LLM-as-judge on 1% sample
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.