Learnixo
Back to blog
AI Systemsintermediate

Interview Q&A: Attention Mechanism

Common interview questions and model answers about attention — the mechanism, scaling, multi-head, KV cache, and complexity — framed for senior ML and systems engineering roles.

Asma Hafeez KhanMay 16, 20264 min read
TransformersAttentionInterviewQ&ANLP
Share:𝕏

Q: Explain attention in one sentence.

Attention computes a weighted sum of values, where the weights are learned by comparing each query to all keys using a dot product.


Q: Why do we divide by √dₖ?

Without scaling, dot products between d-dimensional vectors grow proportionally to √d. For large dₖ (e.g., 64), raw scores become large, pushing the softmax into its saturation region where gradients are near zero. Dividing by √dₖ normalises the variance of the dot product to ~1, keeping the softmax in a gradient-friendly regime.

Intuition: if Q and K have unit-variance components,
q · k = Σ qᵢkᵢ has variance = dₖ
√(variance) = √dₖ  →  divide by √dₖ to get variance = 1

Q: What is the computational complexity of self-attention?

O(n²·d) in time and O(n²) in memory, where n is sequence length and d is head dimension. The bottleneck is the n×n attention matrix. For n=100K tokens, this matrix has 10B entries — infeasible naively. Solutions:

  • FlashAttention: reorders computation to avoid materialising the full n×n matrix; O(n²) time but O(n) memory via tiling
  • Sparse attention: only attend to local windows or selected global tokens; O(n·w) where w is window size
  • Linear attention: approximate the softmax with kernel methods; O(n·d)

Q: What is the KV cache and why does it matter?

During autoregressive generation, each new token only needs to compute its own Q. The K and V vectors for all previous tokens are unchanged — caching them avoids recomputation:

Without KV cache: O(n²) work per new token (recompute all past K/V)
With KV cache:    O(n)  work per new token (compute only new K/V, reuse past)

Memory cost: 2 × seq_len × num_heads × head_dim × bytes_per_element
For LLaMA 7B serving 4K context batch of 32: ~2GB of KV cache

KV cache trades memory for speed. At large batch sizes or long contexts, it becomes the memory bottleneck.


Q: What is masked attention and where is it used?

Causal masking prevents position i from attending to positions i+1, i+2, ... by setting those attention scores to -∞ before softmax, so their weights become 0.

Used in:

  • Decoder self-attention (GPT, LLaMA) — ensures the model can only use past tokens during generation
  • Decoder blocks of encoder-decoder models (original Transformer, T5) — same reason

Not used in:

  • Encoder self-attention (BERT) — bidirectional attention is the whole point

Q: What's the difference between self-attention and cross-attention?

| | Self-Attention | Cross-Attention | |--|--|--| | Q source | Same sequence as K/V | Different sequence from K/V | | Use | Encoder, Decoder self | Encoder-decoder bridge | | Example | BERT's encoder | T5 decoder attending to encoder |

In self-attention, Q=K=V=x (same input). In cross-attention, Q comes from the decoder and K/V come from the encoder's output — letting the decoder "read" the source.


Q: What do attention heads learn?

Empirically (Voita et al., Clark et al.):

  • Syntactic heads: attend to grammatical dependencies (subject-verb, modifier-noun)
  • Positional heads: attend to fixed offsets (next/previous token)
  • Coreference heads: track pronoun references across long distances
  • Rare token heads: disproportionately attend to rare or unusual tokens

Not all heads are equally useful. Studies show that up to 70% of heads can be pruned with minimal performance loss, suggesting significant redundancy.


Q: How would you reduce attention's memory footprint in production?

Several complementary approaches:

  1. FlashAttention: fused CUDA kernel, O(n) memory instead of O(n²)
  2. Grouped-query attention (GQA): multiple Q heads share one K/V head — reduces KV cache by h×
  3. Multi-query attention (MQA): all Q heads share a single K/V — extreme KV cache reduction
  4. 8-bit quantisation of KV cache: store K/V in int8 instead of float16
  5. Paged attention (vLLM): dynamically allocate KV cache pages, avoid fragmentation
  6. Context window sliding/streaming: evict oldest tokens from KV cache for long sequences

Interview Answer Template

"Self-attention computes score(i,j) = qᵢ·kⱼ/√dₖ, converts to weights via softmax, then returns the weighted sum of value vectors. Dividing by √dₖ prevents softmax saturation. Multi-head attention runs h parallel attention operations in different subspaces and concatenates the results. The O(n²) complexity and n×n memory are the main scaling challenges — FlashAttention addresses memory, GQA reduces the KV cache. Causal masking limits each position to past tokens only, enabling autoregressive generation."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.