Interview: LLMs Deep Dive (Part 2)

Q1: What is emergence in LLMs and how does it affect production AI planning?

Answer: Emergence refers to capabilities that appear suddenly as model scale increases — absent in smaller models and present in larger ones, without being explicitly trained for. Examples: chain-of-thought reasoning emerges around 100B parameters; 3-digit arithmetic emerges around 8B; in-context few-shot learning emerges around 13B.

Why it's not just gradual improvement: On binary metrics (correct/incorrect), a model that goes from 49% to 51% on a multi-step sub-task appears to jump from 0% to 100% on the full task (requires all sub-tasks). This creates the appearance of sudden capability emergence.

Production implications:

Don't extrapolate current model limitations to next-generation models — a task your current model fails may work well after a model upgrade
Run full eval suites on model upgrades, not just spot checks
Safety evaluations are model-version-specific — a safe 7B system may behave differently as a 70B system
Budget for capability surprises: some tasks you paid humans to do may be automatable after a model upgrade

Counter-consideration (Schaeffer et al., 2023): Many emergence results are metric artifacts. Measuring with smooth metrics (log-prob) shows gradual improvement rather than sudden jumps. The discontinuity is in the measurement, not the model. This is relevant when choosing evaluation metrics — binary pass/fail can hide gradual progress.

Q2: When do you choose fine-tuning over prompt engineering, and vice versa?

Answer:

Choose prompt engineering when:

The task is within the model's pretraining distribution (the model already has the knowledge)
You need to iterate quickly — prompt changes deploy in minutes
Format and behavioral constraints are the main need
The task type changes frequently (general-purpose system)
You're at an early stage without a clear target task

Choose fine-tuning when:

Prompt engineering has plateaued — you've optimized the prompt and still can't hit target quality
Consistent format is critical and prompts can't achieve it reliably (structured output, specific voice)
The task requires knowledge not in pretraining (proprietary terminology, internal process knowledge)
Cost at scale — a smaller fine-tuned model can outperform prompting a much larger model at lower cost
Latency — a fine-tuned 7B model may be fast enough where GPT-4 is too slow

Process: Start with prompt engineering. Build an eval suite first. When prompt optimization hits diminishing returns, use your eval suite to measure fine-tuning gains. If fine-tuning adds more than 5-10% on your eval suite and the dataset cost is justified, switch.

Common mistake: Fine-tuning without an eval suite. Then you don't know if fine-tuning helped.

Q3: Explain how Grouped Query Attention (GQA) works and why it matters for inference.

Answer: Standard Multi-Head Attention (MHA) has one key and value projection per attention head. In a 32-head model, you have 32 K and 32 V matrices — all stored in the KV cache.

GQA groups query heads to share K/V pairs:

32 query heads
8 KV heads (groups of 4 query heads share the same K, V)

KV cache reduction: 4× smaller KV cache for the same model size. For a 70B model at 32 context length:

MHA: 80 layers × 64 heads × 128 head_dim × 2 (K, V) × 2 bytes = 2.7 GB per 4K context
GQA (8 KV): 0.67 GB per 4K context

This 4× reduction in KV cache memory allows:

Larger batch sizes (more requests per GPU)
Longer contexts within the same memory budget
More memory available for model weights (enabling larger models on the same hardware)

Quality tradeoff: Empirically minimal — LLaMA-3 8B uses 8 KV heads vs 32 query heads with minimal quality degradation. The query heads can still differentiate different attention patterns even when sharing K/V.

Multi-Query Attention (MQA) goes further — all query heads share a single K, V. More memory efficient but larger quality drop. GQA is the sweet spot.

Q4: What are the main failure modes of RLHF and how does DPO address them?

Answer:

RLHF/PPO failure modes:

Reward hacking: The policy finds inputs that score high on the RM without actually being better. Common: verbosity (if RM associates length with quality), certain phrases, sycophantic responses.
KL instability: If KL coefficient is too low, the policy diverges and generates nonsensical text. Too high, and alignment barely happens.
Reward model overfitting: If RM training data is small, the RM is overfit to specific patterns and generalizes poorly to novel inputs.
Online generation bottleneck: PPO requires generating samples during training, which is slow and memory-intensive. Hard to parallelize efficiently.
4-model memory requirement: Policy, reference, reward, and value head all in GPU memory simultaneously.

How DPO addresses these:

No reward model: Removes RM overfitting and reward hacking as failure modes
Offline training: Trains on a fixed preference dataset — no slow online generation step
2-model memory: Policy and reference model only (no RM, no value head)
Simpler hyperparameters: Beta (KL coefficient) is the main knob; no clip ratio, value loss coefficient, etc.

DPO's own failure modes:

Sensitive to preference data quality — noisy preferences hurt more than in PPO
Offline learning: can't improve beyond what's in the preference dataset
Requires good SFT initialization — DPO on top of a weak SFT doesn't work well

Q5: Describe the NTK-aware RoPE scaling technique and why it works.

Answer: RoPE (Rotary Position Embedding) encodes position by rotating query and key vectors. The rotation frequency for dimension pair d is: θ_d = 1 / (base^(2d/D)). At training max length L, positions 0 to L-1 are seen. Beyond L, the model receives position indices outside its training distribution.

Linear scaling problem: Dividing positions by a scale factor s = L_target/L_train maps position L_target to L_train. But position 2L_train and L_train are now both mapped to L_train — the model can't distinguish them (position aliasing).

NTK-aware scaling intuition: Instead of scaling positions, scale the RoPE base frequency. A larger base means lower frequencies at every dimension, effectively stretching all wavelengths proportionally:

base_scaled = base × (L_target/L_train)^(D/(D-2))

Why this works: High-frequency RoPE dimensions handle local position structure (nearby token relationships); low-frequency dimensions handle global position (long-range relationships). NTK scaling stretches all dimensions so that:

Local structure is interpolated between trained positions (works well)
Global structure is extrapolated (handled by training on nearby positions at test time)

The key mathematical insight: this approach avoids out-of-distribution position values by mapping them back into the trained range via the frequency change, rather than directly scaling positions.

Practical result: NTK scaling can extend a model trained at 8K tokens to 32K tokens with under 0.3 perplexity increase, without any fine-tuning. With a few hundred steps of fine-tuning on long documents, the extension holds at 4-8× the original context length.

Q6: How do you evaluate an LLM for clinical use, and what metrics matter?

Answer: Generic benchmarks (MMLU, HumanEval) are necessary but not sufficient for clinical deployment. A rigorous clinical evaluation requires domain-specific assessment:

Evaluation layers:

1. Retrieval evaluation (for RAG systems):

Recall@5: Are the relevant documents retrieved in the top 5?
MRR (Mean Reciprocal Rank): Average rank of the first relevant document

2. Factual accuracy by category:

Drug interaction severity: compare to Lexicomp/Micromedex ground truth
Renal dosing: compare to KDIGO guidelines or FDA prescribing information
Pharmacokinetics: compare to established reference values
Per-category pass rate with human pharmacist review on failed cases

3. Safety-critical behaviors:

Refusal rate on clearly harmful requests (target: 100%)
False refusal rate on appropriate clinical questions (target: under 5%)
Uncertainty expression: does the model say "I'm not certain" when it should?

4. Adversarial robustness:

Injection resistance: does asking about a fictitious drug result in fabricated information?
Authority claim resistance: does claiming to be a doctor unlock restricted content?
Sycophancy: does the model maintain correct answers when users push back?

5. Clinical bias:

Do response quality or completeness differ across patient demographics?
Gender, race, age-based variation in pharmacotherapy recommendations

Measurement approach: LLM-as-judge with a fixed rubric for scalable evaluation; pharmacist review for calibration of the judge; category-specific thresholds (safety must be 100%, accuracy target varies by risk level).

Q7: Explain the Chinchilla scaling laws and their practical implications.

Answer: The Chinchilla paper (Hoffmann et al., DeepMind, 2022) established the optimal compute allocation between model size and training data:

The finding: For a given compute budget C, the optimal model has approximately N* ≈ (C/6)^0.5 parameters, trained on D* ≈ 20N* tokens.

Rule of thumb: 20 tokens per parameter for optimal training.

This overturned a prior assumption from the Kaplan et al. (2020) scaling laws, which concluded that model size should scale faster than data. Kaplan's models were undertrained; Chinchilla shows equal scaling of both.

Practical implications:

GPT-3 (175B, 300B tokens) was undertrained. Chinchilla-optimal for 175B parameters is 3.5 trillion tokens. LLaMA and subsequent models applied this insight — LLaMA-3-8B was trained on 15 trillion tokens (1875 tokens/parameter).
Smaller, well-trained models beat larger undertrained models. Chinchilla 70B outperformed GPT-3 175B on most benchmarks despite having fewer parameters, because it was trained optimally.
Inference cost matters. A 7B model trained on 140B tokens may have the same loss as a 70B model trained on 14B tokens — but at inference, the 7B model is 10× cheaper to run. Training is a one-time cost; inference is ongoing.
Practical divergence: In practice, researchers often train smaller models on far more data than Chinchilla-optimal for the benefit of cheaper inference. LLaMA-3-8B is "over-trained" by Chinchilla standards but produces a small, high-quality model suitable for deployment.

Q8: What is catastrophic forgetting and how do methods like LoRA mitigate it?

Answer: Catastrophic forgetting is the tendency for neural networks to forget previously learned tasks when fine-tuned on new ones. When you fine-tune a pretrained LLM on a narrow domain, it can "overwrite" general language capabilities with domain-specific patterns.

Why it happens: Gradient updates during fine-tuning move all parameters toward the new objective. Weights that encoded general capabilities are modified to better fit the new domain data, destroying the prior encoding.

LoRA mitigation: LoRA (Low-Rank Adaptation) freezes all original weights and adds small adapter matrices (A and B of rank r) to selected weight matrices. Only A and B are trained.

W_adapted = W_pretrained + αAB

Since W_pretrained is frozen, the original capabilities are preserved. The adapters learn the domain-specific adjustments on top of the frozen base.

Why this works for catastrophic forgetting:

The pretrained weights encode general knowledge and remain unchanged
Adapters encode only the delta for the new domain
If you remove the adapters, you get the original model back perfectly
Multiple adapter sets can exist (swap adapters to switch domains without retraining)

Additional techniques:

Elastic Weight Consolidation (EWC): Add a regularization term that penalizes large changes to weights that were important for previous tasks
Continual Learning with replay: Mix some pretraining data into fine-tuning to maintain general capabilities
Parameter-efficient methods generally: Adapters, prefix tuning, and prompt tuning all preserve base model weights

Q9: How does Constitutional AI (CAI) differ from RLHF for alignment, and what are its advantages?

Answer: RLHF relies on human preference labelers to score responses and generate training signal. CAI (Anthropic, 2022) replaces most human labeling with AI-generated preference labels guided by a written constitution.

CAI stages:

SL-CAI (Supervised Learning): Generate responses, critique them against the constitution (via self-critique), revise them. Fine-tune on the revised responses. This teaches the model to be helpful and harmless from principles, not just demonstrations.
RLAIF (RL from AI Feedback): Generate preference pairs by asking a separate model to evaluate responses against the constitution. Use these AI-generated preferences to train a reward model, then apply PPO.

Advantages over RLHF:

Scalability: AI labeling scales to billions of pairs; human labeling scales to hundreds of thousands
Consistency: AI applies the same principles consistently; human labelers have 25-30% inter-rater disagreement
Transparency: The constitution is a human-readable document — you can inspect and modify what the model is optimizing for
Reduced human exposure: Human labelers in RLHF are exposed to harmful content to label it; AI labelers aren't harmed

Tradeoffs:

The AI labeler inherits biases from its own training
Constitutional principles must be carefully written to avoid unintended interpretations
May miss harm types not covered by the constitution

Practical outcome: Anthropic found CAI-trained models were significantly less harmful with comparable helpfulness to RLHF — and at lower human labeling cost.

Q10: System design — how would you build a multi-modal clinical documentation assistant?

Scenario: Clinicians dictate notes, attach lab images and ECGs. Build an AI system that generates structured clinical documentation.

Answer:

Input pipeline:

Audio recording → Whisper (STT) → Clinical transcript
Lab images → GPT-4V → Structured lab result extraction
ECG images → GPT-4V + clinical prompt → Rhythm classification + interval measurements

Processing pipeline:

Python

def process_clinical_encounter(
    audio_path: str,
    image_paths: list[str],
    patient_context: dict,
) -> dict:
    """Multi-modal clinical documentation generation."""
    
    # 1. Transcribe audio
    transcript = transcribe_clinical_audio(audio_path)
    
    # 2. Extract structured data from images
    structured_data = []
    for img_path in image_paths:
        img_type = classify_image_type(img_path)  # lab, ecg, x-ray
        if img_type == "lab_report":
            structured_data.append(extract_lab_results(img_path))
        elif img_type == "ecg":
            structured_data.append(analyze_ecg(img_path))
    
    # 3. Generate structured documentation
    documentation = generate_clinical_note(
        transcript=transcript,
        structured_data=structured_data,
        patient_context=patient_context,
    )
    
    return documentation

Safety and compliance:

All PHI stays on-premises — use self-hosted models (LLaMA-3, Whisper)
Audit log every extraction and generation with model version
Human review required before EHR submission (AI as decision support, not final documentation)
Output validation: structured fields (ICD codes, drug names) checked against validated dictionaries

Quality evaluation:

Character error rate (CER) on transcript vs gold standard
Structured field extraction accuracy vs human extraction on 200 test cases
Documentation completeness scoring by clinician reviewers
Downstream: does the AI-assisted documentation reduce documentation time? (hours per clinician per week)

Model choices:

STT: Whisper Large V3 (on-premises, HIPAA-compliant)
Vision: GPT-4o (with BAA) or Claude Sonnet 4.6 (with BAA)
Note generation: Claude Sonnet (strong at structured clinical text)
Validation: rule-based checks + LLM-as-judge on 1% sample

Interview: LLMs Deep Dive (Part 2)

Q1: What is emergence in LLMs and how does it affect production AI planning?

Q2: When do you choose fine-tuning over prompt engineering, and vice versa?

Q3: Explain how Grouped Query Attention (GQA) works and why it matters for inference.

Q4: What are the main failure modes of RLHF and how does DPO address them?

Q5: Describe the NTK-aware RoPE scaling technique and why it works.

Q6: How do you evaluate an LLM for clinical use, and what metrics matter?

Q7: Explain the Chinchilla scaling laws and their practical implications.

Q8: What is catastrophic forgetting and how do methods like LoRA mitigate it?

Q9: How does Constitutional AI (CAI) differ from RLHF for alignment, and what are its advantages?

Q10: System design — how would you build a multi-modal clinical documentation assistant?

Enjoyed this article?

Leave a comment