Learnixo

Transformer Architecture Q&A · Lesson 1 of 23

What is Attention? The Core Intuition

The Problem Attention Solves

Before attention, sequence models used RNNs and LSTMs. These processed tokens one at a time, left to right, compressing the entire context into a fixed-size hidden state. The problem: information from early tokens gets overwritten by the time the model processes later tokens. For long sequences, the model effectively forgets the beginning.

RNN information flow:
token₁ → h₁ → token₂ → h₂ → token₃ → h₃ → ... → hₙ → output
         ↑ entire history compressed here ↑

Problem: h₁ has almost no influence on the output for long sequences.

Attention was invented to give the model direct access to all previous tokens at every step — no compression, no forgetting.


The Core Intuition

Attention asks: "For this output position, which input positions are most relevant?"

Translating: "The bank near the river is steep"
              ↓
When generating "bank", the model should attend to "river" and "steep"
— not "the" or "near" — to resolve the sense (financial vs riverbank)

Attention scores for word "bank":
  the:   0.05
  bank:  0.15  (itself)
  near:  0.06
  the:   0.05
  river: 0.52  ← most relevant
  is:    0.07
  steep: 0.10  ← also relevant

This context-dependent weighting allows the model to select relevant information dynamically, based on the current query position.


What Attention Computes

Attention takes three inputs — Query (Q), Key (K), Value (V) — and produces a weighted sum:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Steps:
1. Compute similarity scores: S = QKᵀ           (dot product of query with all keys)
2. Scale:                      S = S / √dₖ       (prevent vanishing gradients)
3. Normalize to probabilities: A = softmax(S)    (attention weights, sum to 1)
4. Weighted sum of values:     Output = A · V    (blend values by relevance)

The output at each position is a weighted combination of all value vectors, where the weights depend on how similar each key is to the query.


Attention vs RNNs

| Property | RNN/LSTM | Attention | |----------|----------|-----------| | Path length (token i → token j) | O(n) steps | O(1) — direct | | Parallelism | Sequential (can't parallelize) | Fully parallel | | Long-range dependencies | Poor (gradient vanishing) | Excellent (direct connection) | | Computational cost | O(n) | O(n²) |

The quadratic cost O(n²) is the main weakness of attention — for very long sequences, it becomes expensive. Various efficient attention variants (sparse attention, flash attention, linear attention) address this.


Self-Attention vs Cross-Attention

Self-attention: Q, K, V all come from the same sequence. → Used in: encoder layers, decoder layers (masked) → Purpose: model relationships within the same sequence

Cross-attention: Q comes from one sequence, K and V from another. → Used in: encoder-decoder models (e.g., T5, original Transformer) → Purpose: model relationships between sequences (e.g., source and target in translation)


Interview Answer

"Attention is a mechanism that computes, for each output position, a weighted sum of all input representations. The weights are computed by comparing a Query vector (what we're looking for) against Key vectors (what's available), normalised with softmax. The corresponding Value vectors are then blended by these weights. This gives the model direct O(1) access to any input position regardless of distance, solving the long-range dependency problem that plagued RNNs."