Interview: LLMs Deep Dive (Part 1)

Q1: Explain the pretraining objective of GPT-style models and why it scales so well.

Answer: GPT models are trained with the causal language modeling objective: predict the next token given all previous tokens. The loss is cross-entropy over the vocabulary at every token position:

ℒ = -(1/T) Σᵢ log P(tᵢ | t₁, ..., tᵢ₋₁)

Why it scales well:

Dense supervision: Every token in every sequence generates a gradient update. A 2048-token document produces 2048 training signals. Compare to image classification where one label trains on the entire image.
Unlimited data: Any text ever written is training data — no annotation required. The internet contains approximately 10-100 trillion tokens of text.
Task compression: To minimize next-token prediction loss, the model must learn everything humans know that's encoded in language: grammar, facts, reasoning patterns, common sense. All these capabilities emerge from one simple objective.
Smooth scaling: Loss decreases predictably (power law) with both compute and data, enabling confident extrapolation.

The scaling hypothesis (Kaplan et al., 2020) showed that loss follows: L ∝ N^-0.076 (model size) and L ∝ D^-0.095 (dataset size), with cross-scaling optimal allocation described by Chinchilla (roughly 20 tokens per parameter).

Q2: What is the purpose of the softmax temperature in generation, and what happens at the extremes?

Answer: Temperature T scales logits before softmax:

P(token_i) = exp(logit_i / T) / Σⱼ exp(logit_j / T)

At T → 0 (greedy): The distribution collapses to argmax. The highest-probability token is always selected. Deterministic, but no exploration — can loop and lacks creativity.

At T = 1.0: Sample from the model's trained distribution directly.

At T → ∞: Distribution becomes uniform over all tokens. Output is random noise.

Practical values:

T = 0: Factual Q&A, drug interactions, clinical decisions — one right answer, no variation wanted
T = 0.2-0.4: Code generation — mostly deterministic but allows style variation
T = 0.7: General chat — natural, non-repetitive variation
T = 0.9-1.2: Creative writing — diversity and surprise are desirable

Common misconception: Temperature is not about the model's "confidence." High temperature doesn't mean the model is uncertain — it means the sampling is more random. The model's logits are unchanged; only the sampling distribution changes.

Q3: Describe the RLHF pipeline end-to-end, including why the KL penalty is necessary.

Answer: RLHF has three stages:

Stage 1 (SFT): Fine-tune a pretrained model on demonstrations of desired behavior. This gives a competent base with basic instruction-following.

Stage 2 (Reward Model): Human labelers rank pairs of responses. A reward model (same architecture, different output head: scalar instead of logits) is trained to predict human preferences using the Bradley-Terry loss:

ℒ = -log σ(r(x, y_preferred) - r(x, y_rejected))

Stage 3 (PPO): Optimize the SFT model to maximize reward model scores, using PPO. The objective:

max_π E[r(x, y)] - β·KL(π || π_SFT)

Why the KL penalty is critical: The reward model is an imperfect proxy for human judgment. Without the KL term, the policy will "reward hack" — find inputs that score high on the reward model but represent degenerate behavior. Classic examples: extremely long responses (if the RM learned length correlates with quality), repetitive text, or certain phrases that humans rated highly. The KL penalty prevents the policy from drifting far from the SFT initialization, which preserves language quality while the reward model provides direction.

Q4: What's the difference between GPTQ and AWQ quantization, and when would you choose each?

Answer:

GPTQ (Post-Training Quantization with Calibration): Uses a small calibration dataset to reduce quantization error layer by layer. Applies the Optimal Brain Compression algorithm to compensate for weight rounding errors by adjusting neighboring weights. Calibration takes hours but produces good quality int4 models.

AWQ (Activation-Aware Weight Quantization): Identifies which weights are most important by examining activation magnitudes on calibration data. Protects salient weights with higher precision while quantizing less important weights more aggressively. Faster to quantize than GPTQ, produces efficient GEMM kernels.

Key differences:

| | GPTQ | AWQ | |---|---|---| | Calibration speed | Slower (hours) | Faster | | Inference speed | Slower (CUDA cores) | Faster (custom kernels) | | Accuracy | Slightly better | Comparable | | Small batch inference | Similar | AWQ GEMV kernel better | | Framework support | AutoGPTQ, vLLM | AutoAWQ, vLLM |

Choose GPTQ when: Accuracy is paramount and you have time to run calibration. Good for offline inference where throughput matters.

Choose AWQ when: Latency-sensitive inference with small batch sizes. The GEMV kernel is optimized for single-request decoding. Also better if you need to re-quantize frequently (faster calibration).

Q5: Explain PagedAttention and why it dramatically improves LLM serving throughput.

Answer: Standard KV cache allocation reserves a contiguous memory block of size max_seq_len × kv_size for each request at the start. This causes two problems:

Internal fragmentation: A request that generates 100 tokens wastes the remaining 3996 reserved slots (for max 4096 context)
External fragmentation: Many small gaps in memory, but no contiguous block large enough for a new request

PagedAttention (Kwon et al., vLLM 2023) applies OS virtual memory ideas to KV cache management:

Divide KV cache into fixed-size blocks (e.g., 16 tokens each)
Maintain a block table mapping logical positions to physical blocks
Allocate blocks on-demand as tokens are generated — no pre-allocation
Non-contiguous physical blocks: blocks for one request can be anywhere in memory

Why this helps throughput:

GPU memory utilization goes from ~20-40% (with wasted reservations) to nearly 100%
More requests can fit simultaneously on the GPU
Enables prompt sharing: multiple requests with identical prefixes (e.g., same system prompt) share physical blocks — copy-on-write diverges them only at the generation point
vLLM benchmarks showed 2-4× throughput improvement over Hugging Face Text Generation Inference for concurrent requests

Q6: What causes the "lost in the middle" problem and how do you mitigate it?

Answer: LLMs attend better to content at the beginning and end of the context window than to content in the middle. This has been empirically documented by Liu et al. (2023): in experiments placing a relevant document at different positions in a 20-document context, accuracy peaks when the relevant document is first or last, and drops significantly when it's in the middle.

Mechanism: Transformer attention is not position-invariant. Positional embeddings and the causal attention pattern create position-dependent biases. The model's training data also likely has an implicit bias toward earlier content (documents typically have key information at the beginning and end).

Mitigations:

Strategic ordering: Most relevant document at position 0; second most relevant just before the question.
Fewer, more relevant documents: Reducing from 20 to 5 documents shrinks the "middle" dramatically. Better retrieval precision reduces the need for many documents.
Reranking: After initial retrieval, use a cross-encoder to rerank documents. Place the top-ranked document first.
Chunk compression: Summarize each document before insertion (cheap model for summarization, expensive model for answering). Shorter documents reduce the "middle zone."
Iterative retrieval: Instead of loading all documents at once, ask the model which document is most relevant from titles/abstracts, then retrieve only that document.

Q7: How does speculative decoding work and what determines the speedup?

Answer: Speculative decoding uses a small, fast draft model to propose multiple tokens at once, then verifies them in parallel with the large target model.

Algorithm:

Draft model generates K tokens autoregressively (fast, small model)
Target model scores all K+1 positions in a single forward pass (parallel)
For each position i, accept the draft token if uniform(0,1) < min(1, p_target(t_i) / p_draft(t_i))
At the first rejection, discard remaining draft tokens and sample from the corrected distribution
Continue from the last accepted position

Key property: The acceptance criterion preserves the exact distribution of the target model — speculative decoding produces identically distributed outputs as the target model alone.

Speedup formula:

Speedup = E[accepted tokens + 1] / 1 token
         ≈ (1 - α^(K+1)) / (1 - α)  where α = average acceptance rate

For α = 0.7, K = 4: speedup ≈ 2.3×

What determines speedup:

Draft model quality: A good draft model produces tokens the target would also produce → high acceptance rate → more free tokens
Distribution similarity: Draft and target model should be in the same model family (smaller version of the same model)
Sequence type: Highly predictable text (code, prose) has higher acceptance rates than diverse creative text

Production use: GPT-4's API reportedly uses speculative decoding with a smaller Claude/GPT variant as the draft model.

Q8: Explain DPO and how it achieves alignment without reinforcement learning.

Answer: DPO (Direct Preference Optimization) is based on the insight that the optimal RLHF policy has a closed-form expression in terms of the reference policy:

π*(y|x) = π_ref(y|x) · exp(r(x,y)/β) / Z(x)

Rearranging: r(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x)

When this is substituted into the Bradley-Terry preference model, the normalizer Z(x) cancels:

P(y_w ≻ y_l | x) = σ(β·log(π_θ(y_w|x)/π_ref(y_w|x)) - β·log(π_θ(y_l|x)/π_ref(y_l|x)))

This gives us a loss we can directly minimize:

ℒ_DPO = -E log σ(β·[log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)])

DPO vs PPO advantages:

No reward model to train or maintain
No online generation needed (offline training on fixed dataset)
No value function / GAE computation
Simpler implementation, more stable training
Similar or better empirical results on alignment benchmarks

Tradeoff: DPO is offline — it optimizes on the fixed preference dataset and cannot explore new responses. PPO can improve beyond the preference dataset through online generation.

Q9: What is KV cache and how do you calculate its memory usage?

Answer: The KV cache stores the key and value tensors computed during the prefill phase so they don't need to be recomputed during autoregressive decoding.

Memory formula:

KV cache per token = 2 × n_layers × n_heads × head_dim × bytes_per_element

For LLaMA-3-8B (bfloat16):

2 (K and V) × 32 layers × 32 heads × 128 head_dim × 2 bytes = 524,288 bytes ≈ 0.5 MB per token

For a 4096-token sequence: 0.5 MB × 4096 = 2 GB per request

For a concurrent batch of 32 requests each with 4096 tokens: 64 GB — this is why context length limits batch size.

GQA (Grouped Query Attention) reduces KV cache:

Standard MHA: n_heads K/V pairs
GQA: n_kv_heads K/V pairs (shared by groups of query heads)
LLaMA-3-8B uses 8 KV heads (vs 32 query heads) — 4× KV cache reduction

With GQA: 0.5 MB × (8/32) = 0.125 MB per token, making 16 GB of KV cache support 128K tokens.

Q10: Design a production system for a clinical AI assistant handling 10,000 requests per day.

Answer:

Requirements analysis:

10,000 requests/day ≈ 7 req/min (not heavy volume, but healthcare requires 99.9% uptime)
Clinical context: low hallucination tolerance, PHI handling, audit requirements

Architecture:

API Gateway (auth, rate limiting, request logging)
    ↓
RAG Pipeline:
  - Query embedding → Pinecone vector search → Top-5 relevant documents
  - Hybrid: BM25 + dense embeddings for drug name recall
    ↓
LLM Inference:
  - Primary: Claude Sonnet (quality + moderate cost)
  - Fallback: GPT-4o (if Anthropic unavailable)
  - System prompt with clinical grounding instructions
    ↓
Output Pipeline:
  - Response validation (drug name checker, citation verifier)
  - Audit log (every query + response stored with timestamp, user, model version)
  - Disclaimer injection (UI-level, not prompt-level)

Cost estimate: 10,000 requests × 1,200 tokens (avg input + context) × $3/M + 10,000 × 300 tokens output × $15/M = $36 + $45 = ~$81/day ≈ $2,400/month for Claude Sonnet.

Safety measures:

Input validation: detect and block prompt injection attempts
Output validation: flag responses mentioning unknown drug names or specific doses without citation
Human review queue: flag cases where model expressed uncertainty or response validation failed
Prompt version control: CI/CD eval gate before any system prompt change

Monitoring:

Perplexity and token count per request (detect context overflow)
LLM-as-judge on 1% random sample (automated quality scoring)
Latency p50/p95/p99 with alerts on p95 exceeding 5 seconds