Transformer Architecture Q&A · Lesson 20 of 23
Interview: Explain Attention from Scratch
Q: Explain attention in one sentence.
Attention computes a weighted sum of values, where the weights are learned by comparing each query to all keys using a dot product.
Q: Why do we divide by √dₖ?
Without scaling, dot products between d-dimensional vectors grow proportionally to √d. For large dₖ (e.g., 64), raw scores become large, pushing the softmax into its saturation region where gradients are near zero. Dividing by √dₖ normalises the variance of the dot product to ~1, keeping the softmax in a gradient-friendly regime.
Intuition: if Q and K have unit-variance components,
q · k = Σ qᵢkᵢ has variance = dₖ
√(variance) = √dₖ → divide by √dₖ to get variance = 1Q: What is the computational complexity of self-attention?
O(n²·d) in time and O(n²) in memory, where n is sequence length and d is head dimension. The bottleneck is the n×n attention matrix. For n=100K tokens, this matrix has 10B entries — infeasible naively. Solutions:
- FlashAttention: reorders computation to avoid materialising the full n×n matrix; O(n²) time but O(n) memory via tiling
- Sparse attention: only attend to local windows or selected global tokens; O(n·w) where w is window size
- Linear attention: approximate the softmax with kernel methods; O(n·d)
Q: What is the KV cache and why does it matter?
During autoregressive generation, each new token only needs to compute its own Q. The K and V vectors for all previous tokens are unchanged — caching them avoids recomputation:
Without KV cache: O(n²) work per new token (recompute all past K/V)
With KV cache: O(n) work per new token (compute only new K/V, reuse past)
Memory cost: 2 × seq_len × num_heads × head_dim × bytes_per_element
For LLaMA 7B serving 4K context batch of 32: ~2GB of KV cacheKV cache trades memory for speed. At large batch sizes or long contexts, it becomes the memory bottleneck.
Q: What is masked attention and where is it used?
Causal masking prevents position i from attending to positions i+1, i+2, ... by setting those attention scores to -∞ before softmax, so their weights become 0.
Used in:
- Decoder self-attention (GPT, LLaMA) — ensures the model can only use past tokens during generation
- Decoder blocks of encoder-decoder models (original Transformer, T5) — same reason
Not used in:
- Encoder self-attention (BERT) — bidirectional attention is the whole point
Q: What's the difference between self-attention and cross-attention?
| | Self-Attention | Cross-Attention | |--|--|--| | Q source | Same sequence as K/V | Different sequence from K/V | | Use | Encoder, Decoder self | Encoder-decoder bridge | | Example | BERT's encoder | T5 decoder attending to encoder |
In self-attention, Q=K=V=x (same input). In cross-attention, Q comes from the decoder and K/V come from the encoder's output — letting the decoder "read" the source.
Q: What do attention heads learn?
Empirically (Voita et al., Clark et al.):
- Syntactic heads: attend to grammatical dependencies (subject-verb, modifier-noun)
- Positional heads: attend to fixed offsets (next/previous token)
- Coreference heads: track pronoun references across long distances
- Rare token heads: disproportionately attend to rare or unusual tokens
Not all heads are equally useful. Studies show that up to 70% of heads can be pruned with minimal performance loss, suggesting significant redundancy.
Q: How would you reduce attention's memory footprint in production?
Several complementary approaches:
- FlashAttention: fused CUDA kernel, O(n) memory instead of O(n²)
- Grouped-query attention (GQA): multiple Q heads share one K/V head — reduces KV cache by h×
- Multi-query attention (MQA): all Q heads share a single K/V — extreme KV cache reduction
- 8-bit quantisation of KV cache: store K/V in int8 instead of float16
- Paged attention (vLLM): dynamically allocate KV cache pages, avoid fragmentation
- Context window sliding/streaming: evict oldest tokens from KV cache for long sequences
Interview Answer Template
"Self-attention computes score(i,j) = qᵢ·kⱼ/√dₖ, converts to weights via softmax, then returns the weighted sum of value vectors. Dividing by √dₖ prevents softmax saturation. Multi-head attention runs h parallel attention operations in different subspaces and concatenates the results. The O(n²) complexity and n×n memory are the main scaling challenges — FlashAttention addresses memory, GQA reduces the KV cache. Causal masking limits each position to past tokens only, enabling autoregressive generation."