LLMs Deep Dive · Lesson 21 of 24
Interview: Explain the Attention Mechanism
Q: Walk me through the forward pass of a modern LLM.
A token sequence is embedded and positional information is added. The sequence passes through N decoder blocks. Each block:
- Pre-RMSNorm: normalise the input
- Masked self-attention: each token attends only to past tokens via causal mask; Q/K rotated with RoPE; GQA shares K/V across groups of Q heads
- Residual: add the attention output back to the pre-attention input
- Pre-RMSNorm: normalise the residual output
- SwiGLU FFN: expand to 2.67× d_model, gate, contract back to d_model
- Residual: add FFN output back
After all N blocks, the final hidden state is projected to vocabulary size logits via the output embedding (usually tied to the input embedding transpose). Softmax gives token probabilities.
Q: Why did the field move from Post-LN to Pre-LN?
Post-LN (original Transformer): normalize AFTER adding the residual. Gradients at early layers must pass through the full network before being normalized — for deep models (24+), this causes instability, requiring careful warmup.
Pre-LN: normalize BEFORE the sublayer, so the sublayer input is always bounded. The residual path x is always unmodified — gradients flow directly through it. This allows training without elaborate warmup and scales to 96+ layers reliably. Slight quality penalty at convergence, but the stability gain far outweighs it for large models.
Q: Why use SwiGLU instead of GELU?
SwiGLU(x) = (x·W₁) ⊙ SiLU(x·Wgate)·W₂ — uses a gating mechanism. Empirically, SwiGLU outperforms GELU by ~1-2% on downstream benchmarks with the same parameter count (Shazeer, 2020). The gate modulates which features from the first linear layer are passed to the output, providing a learned form of sparsity. LLaMA, PaLM, Mistral, and most modern open-source LLMs use SwiGLU.
Q: What is GQA and why does it exist?
Grouped-Query Attention (GQA) has h query heads but only g < h key/value heads. Each group of h/g query heads shares one K/V pair. The KV cache is proportional to the number of KV heads, not Q heads — GQA reduces the KV cache by a factor of h/g. LLaMA 2 70B uses g=8, h=64, achieving an 8× KV cache reduction vs standard MHA. Quality is comparable to MHA — the reduction in KV heads has minimal quality impact.
Q: Explain tied embeddings.
The input token embedding matrix E_in ∈ ℝ^(vocab×d_model) maps token IDs to vectors. The output LM head maps hidden states back to logits: hidden ∈ ℝ^d_model → logit ∈ ℝ^vocab. Tied embeddings set the output weight matrix = E_inᵀ, so both mappings share the same parameters. Benefits: saves vocab×d_model parameters (131M for LLaMA 2), forces a consistent semantic space between input and output representations. Empirically matches or beats untied embeddings.
Q: How does RMSNorm differ from LayerNorm?
LayerNorm computes mean and variance over the feature dimension, subtracts mean, divides by std, then applies learned scale γ and shift β. RMSNorm skips the mean-centering step: x / RMS(x) × γ where RMS(x) = √(mean(x²)). No β parameter. This saves the mean computation (minor), removes redundancy (the LayerNorm shift β is redundant when combined with the FFN's bias terms, which are also removed in modern LLMs), and empirically performs identically to LayerNorm on large models while being ~10-15% faster.
Q: What's the difference between the original Transformer and LLaMA 2?
| Component | Original Transformer | LLaMA 2 | |-----------|---------------------|---------| | Architecture | Encoder-decoder | Decoder-only | | Normalisation | Post-LayerNorm | Pre-RMSNorm | | Positional | Sinusoidal | RoPE | | Attention | MHA | GQA (70B) | | FFN activation | ReLU | SwiGLU | | FFN biases | Yes | No | | Token embedding | Separate in/out | Tied | | Max context | 512 | 4096 |
Q: What is the memory cost of an LLM in production?
For LLaMA 2 7B fp16:
Model weights: 7B × 2 bytes = 14 GB
KV cache (batch=32, seq=4K): 32 × 4096 × 0.5MB/token = 64 GB
Activations (during generation): small, O(batch × d_model)
Total for batch=32: ~78 GB → requires A100 80GB + quantisation
For a single user (batch=1):
Weights: 14 GB
KV cache: 2 GB
Fits on a single A100 40GB with INT8 quantisation (7 GB + 2 GB)Interview Answer Template
"Modern decoder-only LLMs like LLaMA 2 extend the original Transformer with: Pre-RMSNorm for training stability without warmup sensitivity; RoPE for positional encoding with relative distance properties; SwiGLU FFN for ~1-2% better downstream performance; GQA to reduce KV cache proportional to KV head count; no biases in Linear layers; tied input/output embeddings. The core innovation relative to GPT-style models is these architectural improvements that allow 7B-parameter models to be competitive with earlier 175B models through better training efficiency."