Learnixo

GenAI & LLM Interviews · Lesson 5 of 30

Interview: LLMs Deep Dive (Part 2)

Q1: What is emergence in LLMs and how does it affect production AI planning?

Answer: Emergence refers to capabilities that appear suddenly as model scale increases — absent in smaller models and present in larger ones, without being explicitly trained for. Examples: chain-of-thought reasoning emerges around 100B parameters; 3-digit arithmetic emerges around 8B; in-context few-shot learning emerges around 13B.

Why it's not just gradual improvement: On binary metrics (correct/incorrect), a model that goes from 49% to 51% on a multi-step sub-task appears to jump from 0% to 100% on the full task (requires all sub-tasks). This creates the appearance of sudden capability emergence.

Production implications:

  1. Don't extrapolate current model limitations to next-generation models — a task your current model fails may work well after a model upgrade
  2. Run full eval suites on model upgrades, not just spot checks
  3. Safety evaluations are model-version-specific — a safe 7B system may behave differently as a 70B system
  4. Budget for capability surprises: some tasks you paid humans to do may be automatable after a model upgrade

Counter-consideration (Schaeffer et al., 2023): Many emergence results are metric artifacts. Measuring with smooth metrics (log-prob) shows gradual improvement rather than sudden jumps. The discontinuity is in the measurement, not the model. This is relevant when choosing evaluation metrics — binary pass/fail can hide gradual progress.


Q2: When do you choose fine-tuning over prompt engineering, and vice versa?

Answer:

Choose prompt engineering when:

  • The task is within the model's pretraining distribution (the model already has the knowledge)
  • You need to iterate quickly — prompt changes deploy in minutes
  • Format and behavioral constraints are the main need
  • The task type changes frequently (general-purpose system)
  • You're at an early stage without a clear target task

Choose fine-tuning when:

  • Prompt engineering has plateaued — you've optimized the prompt and still can't hit target quality
  • Consistent format is critical and prompts can't achieve it reliably (structured output, specific voice)
  • The task requires knowledge not in pretraining (proprietary terminology, internal process knowledge)
  • Cost at scale — a smaller fine-tuned model can outperform prompting a much larger model at lower cost
  • Latency — a fine-tuned 7B model may be fast enough where GPT-4 is too slow

Process: Start with prompt engineering. Build an eval suite first. When prompt optimization hits diminishing returns, use your eval suite to measure fine-tuning gains. If fine-tuning adds more than 5-10% on your eval suite and the dataset cost is justified, switch.

Common mistake: Fine-tuning without an eval suite. Then you don't know if fine-tuning helped.


Q3: Explain how Grouped Query Attention (GQA) works and why it matters for inference.

Answer: Standard Multi-Head Attention (MHA) has one key and value projection per attention head. In a 32-head model, you have 32 K and 32 V matrices — all stored in the KV cache.

GQA groups query heads to share K/V pairs:

  • 32 query heads
  • 8 KV heads (groups of 4 query heads share the same K, V)

KV cache reduction: 4× smaller KV cache for the same model size. For a 70B model at 32 context length:

  • MHA: 80 layers × 64 heads × 128 head_dim × 2 (K, V) × 2 bytes = 2.7 GB per 4K context
  • GQA (8 KV): 0.67 GB per 4K context

This 4× reduction in KV cache memory allows:

  • Larger batch sizes (more requests per GPU)
  • Longer contexts within the same memory budget
  • More memory available for model weights (enabling larger models on the same hardware)

Quality tradeoff: Empirically minimal — LLaMA-3 8B uses 8 KV heads vs 32 query heads with minimal quality degradation. The query heads can still differentiate different attention patterns even when sharing K/V.

Multi-Query Attention (MQA) goes further — all query heads share a single K, V. More memory efficient but larger quality drop. GQA is the sweet spot.


Q4: What are the main failure modes of RLHF and how does DPO address them?

Answer:

RLHF/PPO failure modes:

  1. Reward hacking: The policy finds inputs that score high on the RM without actually being better. Common: verbosity (if RM associates length with quality), certain phrases, sycophantic responses.

  2. KL instability: If KL coefficient is too low, the policy diverges and generates nonsensical text. Too high, and alignment barely happens.

  3. Reward model overfitting: If RM training data is small, the RM is overfit to specific patterns and generalizes poorly to novel inputs.

  4. Online generation bottleneck: PPO requires generating samples during training, which is slow and memory-intensive. Hard to parallelize efficiently.

  5. 4-model memory requirement: Policy, reference, reward, and value head all in GPU memory simultaneously.

How DPO addresses these:

  • No reward model: Removes RM overfitting and reward hacking as failure modes
  • Offline training: Trains on a fixed preference dataset — no slow online generation step
  • 2-model memory: Policy and reference model only (no RM, no value head)
  • Simpler hyperparameters: Beta (KL coefficient) is the main knob; no clip ratio, value loss coefficient, etc.

DPO's own failure modes:

  • Sensitive to preference data quality — noisy preferences hurt more than in PPO
  • Offline learning: can't improve beyond what's in the preference dataset
  • Requires good SFT initialization — DPO on top of a weak SFT doesn't work well

Q5: Describe the NTK-aware RoPE scaling technique and why it works.

Answer: RoPE (Rotary Position Embedding) encodes position by rotating query and key vectors. The rotation frequency for dimension pair d is: θ_d = 1 / (base^(2d/D)). At training max length L, positions 0 to L-1 are seen. Beyond L, the model receives position indices outside its training distribution.

Linear scaling problem: Dividing positions by a scale factor s = L_target/L_train maps position L_target to L_train. But position 2L_train and L_train are now both mapped to L_train — the model can't distinguish them (position aliasing).

NTK-aware scaling intuition: Instead of scaling positions, scale the RoPE base frequency. A larger base means lower frequencies at every dimension, effectively stretching all wavelengths proportionally:

base_scaled = base × (L_target/L_train)^(D/(D-2))

Why this works: High-frequency RoPE dimensions handle local position structure (nearby token relationships); low-frequency dimensions handle global position (long-range relationships). NTK scaling stretches all dimensions so that:

  • Local structure is interpolated between trained positions (works well)
  • Global structure is extrapolated (handled by training on nearby positions at test time)

The key mathematical insight: this approach avoids out-of-distribution position values by mapping them back into the trained range via the frequency change, rather than directly scaling positions.

Practical result: NTK scaling can extend a model trained at 8K tokens to 32K tokens with under 0.3 perplexity increase, without any fine-tuning. With a few hundred steps of fine-tuning on long documents, the extension holds at 4-8× the original context length.


Q6: How do you evaluate an LLM for clinical use, and what metrics matter?

Answer: Generic benchmarks (MMLU, HumanEval) are necessary but not sufficient for clinical deployment. A rigorous clinical evaluation requires domain-specific assessment:

Evaluation layers:

1. Retrieval evaluation (for RAG systems):

  • Recall@5: Are the relevant documents retrieved in the top 5?
  • MRR (Mean Reciprocal Rank): Average rank of the first relevant document

2. Factual accuracy by category:

  • Drug interaction severity: compare to Lexicomp/Micromedex ground truth
  • Renal dosing: compare to KDIGO guidelines or FDA prescribing information
  • Pharmacokinetics: compare to established reference values
  • Per-category pass rate with human pharmacist review on failed cases

3. Safety-critical behaviors:

  • Refusal rate on clearly harmful requests (target: 100%)
  • False refusal rate on appropriate clinical questions (target: under 5%)
  • Uncertainty expression: does the model say "I'm not certain" when it should?

4. Adversarial robustness:

  • Injection resistance: does asking about a fictitious drug result in fabricated information?
  • Authority claim resistance: does claiming to be a doctor unlock restricted content?
  • Sycophancy: does the model maintain correct answers when users push back?

5. Clinical bias:

  • Do response quality or completeness differ across patient demographics?
  • Gender, race, age-based variation in pharmacotherapy recommendations

Measurement approach: LLM-as-judge with a fixed rubric for scalable evaluation; pharmacist review for calibration of the judge; category-specific thresholds (safety must be 100%, accuracy target varies by risk level).


Q7: Explain the Chinchilla scaling laws and their practical implications.

Answer: The Chinchilla paper (Hoffmann et al., DeepMind, 2022) established the optimal compute allocation between model size and training data:

The finding: For a given compute budget C, the optimal model has approximately N* ≈ (C/6)^0.5 parameters, trained on D* ≈ 20N* tokens.

Rule of thumb: 20 tokens per parameter for optimal training.

This overturned a prior assumption from the Kaplan et al. (2020) scaling laws, which concluded that model size should scale faster than data. Kaplan's models were undertrained; Chinchilla shows equal scaling of both.

Practical implications:

  1. GPT-3 (175B, 300B tokens) was undertrained. Chinchilla-optimal for 175B parameters is 3.5 trillion tokens. LLaMA and subsequent models applied this insight — LLaMA-3-8B was trained on 15 trillion tokens (1875 tokens/parameter).

  2. Smaller, well-trained models beat larger undertrained models. Chinchilla 70B outperformed GPT-3 175B on most benchmarks despite having fewer parameters, because it was trained optimally.

  3. Inference cost matters. A 7B model trained on 140B tokens may have the same loss as a 70B model trained on 14B tokens — but at inference, the 7B model is 10× cheaper to run. Training is a one-time cost; inference is ongoing.

  4. Practical divergence: In practice, researchers often train smaller models on far more data than Chinchilla-optimal for the benefit of cheaper inference. LLaMA-3-8B is "over-trained" by Chinchilla standards but produces a small, high-quality model suitable for deployment.


Q8: What is catastrophic forgetting and how do methods like LoRA mitigate it?

Answer: Catastrophic forgetting is the tendency for neural networks to forget previously learned tasks when fine-tuned on new ones. When you fine-tune a pretrained LLM on a narrow domain, it can "overwrite" general language capabilities with domain-specific patterns.

Why it happens: Gradient updates during fine-tuning move all parameters toward the new objective. Weights that encoded general capabilities are modified to better fit the new domain data, destroying the prior encoding.

LoRA mitigation: LoRA (Low-Rank Adaptation) freezes all original weights and adds small adapter matrices (A and B of rank r) to selected weight matrices. Only A and B are trained.

W_adapted = W_pretrained + αAB

Since W_pretrained is frozen, the original capabilities are preserved. The adapters learn the domain-specific adjustments on top of the frozen base.

Why this works for catastrophic forgetting:

  • The pretrained weights encode general knowledge and remain unchanged
  • Adapters encode only the delta for the new domain
  • If you remove the adapters, you get the original model back perfectly
  • Multiple adapter sets can exist (swap adapters to switch domains without retraining)

Additional techniques:

  • Elastic Weight Consolidation (EWC): Add a regularization term that penalizes large changes to weights that were important for previous tasks
  • Continual Learning with replay: Mix some pretraining data into fine-tuning to maintain general capabilities
  • Parameter-efficient methods generally: Adapters, prefix tuning, and prompt tuning all preserve base model weights

Q9: How does Constitutional AI (CAI) differ from RLHF for alignment, and what are its advantages?

Answer: RLHF relies on human preference labelers to score responses and generate training signal. CAI (Anthropic, 2022) replaces most human labeling with AI-generated preference labels guided by a written constitution.

CAI stages:

  1. SL-CAI (Supervised Learning): Generate responses, critique them against the constitution (via self-critique), revise them. Fine-tune on the revised responses. This teaches the model to be helpful and harmless from principles, not just demonstrations.

  2. RLAIF (RL from AI Feedback): Generate preference pairs by asking a separate model to evaluate responses against the constitution. Use these AI-generated preferences to train a reward model, then apply PPO.

Advantages over RLHF:

  • Scalability: AI labeling scales to billions of pairs; human labeling scales to hundreds of thousands
  • Consistency: AI applies the same principles consistently; human labelers have 25-30% inter-rater disagreement
  • Transparency: The constitution is a human-readable document — you can inspect and modify what the model is optimizing for
  • Reduced human exposure: Human labelers in RLHF are exposed to harmful content to label it; AI labelers aren't harmed

Tradeoffs:

  • The AI labeler inherits biases from its own training
  • Constitutional principles must be carefully written to avoid unintended interpretations
  • May miss harm types not covered by the constitution

Practical outcome: Anthropic found CAI-trained models were significantly less harmful with comparable helpfulness to RLHF — and at lower human labeling cost.


Q10: System design — how would you build a multi-modal clinical documentation assistant?

Scenario: Clinicians dictate notes, attach lab images and ECGs. Build an AI system that generates structured clinical documentation.

Answer:

Input pipeline:

Audio recording → Whisper (STT) → Clinical transcript
Lab images → GPT-4V → Structured lab result extraction
ECG images → GPT-4V + clinical prompt → Rhythm classification + interval measurements

Processing pipeline:

Python
def process_clinical_encounter(
    audio_path: str,
    image_paths: list[str],
    patient_context: dict,
) -> dict:
    """Multi-modal clinical documentation generation."""
    
    # 1. Transcribe audio
    transcript = transcribe_clinical_audio(audio_path)
    
    # 2. Extract structured data from images
    structured_data = []
    for img_path in image_paths:
        img_type = classify_image_type(img_path)  # lab, ecg, x-ray
        if img_type == "lab_report":
            structured_data.append(extract_lab_results(img_path))
        elif img_type == "ecg":
            structured_data.append(analyze_ecg(img_path))
    
    # 3. Generate structured documentation
    documentation = generate_clinical_note(
        transcript=transcript,
        structured_data=structured_data,
        patient_context=patient_context,
    )
    
    return documentation

Safety and compliance:

  • All PHI stays on-premises — use self-hosted models (LLaMA-3, Whisper)
  • Audit log every extraction and generation with model version
  • Human review required before EHR submission (AI as decision support, not final documentation)
  • Output validation: structured fields (ICD codes, drug names) checked against validated dictionaries

Quality evaluation:

  • Character error rate (CER) on transcript vs gold standard
  • Structured field extraction accuracy vs human extraction on 200 test cases
  • Documentation completeness scoring by clinician reviewers
  • Downstream: does the AI-assisted documentation reduce documentation time? (hours per clinician per week)

Model choices:

  • STT: Whisper Large V3 (on-premises, HIPAA-compliant)
  • Vision: GPT-4o (with BAA) or Claude Sonnet 4.6 (with BAA)
  • Note generation: Claude Sonnet (strong at structured clinical text)
  • Validation: rule-based checks + LLM-as-judge on 1% sample