Interview: Explain Attention from Scratch — Transformer Architecture Q&A | Learnixo

Q: Explain attention in one sentence.

Attention computes a weighted sum of values, where the weights are learned by comparing each query to all keys using a dot product.

Q: Why do we divide by √dₖ?

Without scaling, dot products between d-dimensional vectors grow proportionally to √d. For large dₖ (e.g., 64), raw scores become large, pushing the softmax into its saturation region where gradients are near zero. Dividing by √dₖ normalises the variance of the dot product to ~1, keeping the softmax in a gradient-friendly regime.

Intuition: if Q and K have unit-variance components,
q · k = Σ qᵢkᵢ has variance = dₖ
√(variance) = √dₖ  →  divide by √dₖ to get variance = 1

Q: What is the computational complexity of self-attention?

O(n²·d) in time and O(n²) in memory, where n is sequence length and d is head dimension. The bottleneck is the n×n attention matrix. For n=100K tokens, this matrix has 10B entries — infeasible naively. Solutions:

FlashAttention: reorders computation to avoid materialising the full n×n matrix; O(n²) time but O(n) memory via tiling
Sparse attention: only attend to local windows or selected global tokens; O(n·w) where w is window size
Linear attention: approximate the softmax with kernel methods; O(n·d)

Q: What is the KV cache and why does it matter?

During autoregressive generation, each new token only needs to compute its own Q. The K and V vectors for all previous tokens are unchanged — caching them avoids recomputation:

Without KV cache: O(n²) work per new token (recompute all past K/V)
With KV cache:    O(n)  work per new token (compute only new K/V, reuse past)

Memory cost: 2 × seq_len × num_heads × head_dim × bytes_per_element
For LLaMA 7B serving 4K context batch of 32: ~2GB of KV cache

KV cache trades memory for speed. At large batch sizes or long contexts, it becomes the memory bottleneck.

Q: What is masked attention and where is it used?

Causal masking prevents position i from attending to positions i+1, i+2, ... by setting those attention scores to -∞ before softmax, so their weights become 0.

Used in:

Decoder self-attention (GPT, LLaMA) — ensures the model can only use past tokens during generation
Decoder blocks of encoder-decoder models (original Transformer, T5) — same reason

Not used in:

Encoder self-attention (BERT) — bidirectional attention is the whole point

Q: What's the difference between self-attention and cross-attention?

In self-attention, Q=K=V=x (same input). In cross-attention, Q comes from the decoder and K/V come from the encoder's output — letting the decoder "read" the source.

Q: What do attention heads learn?

Empirically (Voita et al., Clark et al.):

Syntactic heads: attend to grammatical dependencies (subject-verb, modifier-noun)
Positional heads: attend to fixed offsets (next/previous token)
Coreference heads: track pronoun references across long distances
Rare token heads: disproportionately attend to rare or unusual tokens

Not all heads are equally useful. Studies show that up to 70% of heads can be pruned with minimal performance loss, suggesting significant redundancy.

Q: How would you reduce attention's memory footprint in production?

Several complementary approaches:

FlashAttention: fused CUDA kernel, O(n) memory instead of O(n²)
Grouped-query attention (GQA): multiple Q heads share one K/V head — reduces KV cache by h×
Multi-query attention (MQA): all Q heads share a single K/V — extreme KV cache reduction
8-bit quantisation of KV cache: store K/V in int8 instead of float16
Paged attention (vLLM): dynamically allocate KV cache pages, avoid fragmentation
Context window sliding/streaming: evict oldest tokens from KV cache for long sequences

Interview Answer Template

"Self-attention computes score(i,j) = qᵢ·kⱼ/√dₖ, converts to weights via softmax, then returns the weighted sum of value vectors. Dividing by √dₖ prevents softmax saturation. Multi-head attention runs h parallel attention operations in different subspaces and concatenates the results. The O(n²) complexity and n×n memory are the main scaling challenges — FlashAttention addresses memory, GQA reduces the KV cache. Causal masking limits each position to past tokens only, enabling autoregressive generation."