Learnixo

Transformer Architecture Q&A · Lesson 7 of 23

Decoder Stack: Masked Attention + Cross-Attention

Decoder Block Structure

The decoder has three sub-layers (vs two in the encoder):

Input y (previous outputs)
│
├─→ [Masked Multi-Head Self-Attention] → + y → LayerNorm → a
│         ↑ causal mask (can't see future)
│
├─→ [Cross-Attention]                  → + a → LayerNorm → b
│         ↑ Q from decoder, K/V from encoder output
│
└─→ [Feed-Forward Network]             → + b → LayerNorm → output

The additional sub-layer (cross-attention) is what allows the decoder to condition on the encoder's output — the source sentence in translation, or the input document in summarisation.


Sub-Layer 1: Masked Self-Attention

The decoder's self-attention is causally masked — position i can only attend to positions 0..i, not future positions:

Causal mask (4-token sequence):
     pos0  pos1  pos2  pos3
pos0 [  ✓    ✗    ✗    ✗  ]
pos1 [  ✓    ✓    ✗    ✗  ]
pos2 [  ✓    ✓    ✓    ✗  ]
pos3 [  ✓    ✓    ✓    ✓  ]

✗ positions have scores set to -∞ before softmax → attention weight = 0

Why masking: During training, the full target sequence is available (teacher forcing). Without masking, the model could "cheat" by looking at future tokens. At inference, future tokens don't exist yet — the mask simulates this during training.


Sub-Layer 2: Cross-Attention (Encoder-Decoder Attention)

CrossAttention:
  Q = decoder's current representation (from masked self-attention)
  K = encoder's final output
  V = encoder's final output

CrossAttention(Q, K_enc, V_enc) = Attention(Q · Wᴬ, K_enc · Wᴷ, V_enc · Wᵛ)

This is how the decoder "reads" the encoder. Each decoder position can attend to any encoder position — no causal mask here. The decoder learns which source positions are relevant for generating each target token.


Autoregressive Generation

At inference, decoder-only or encoder-decoder models generate one token at a time:

Step 1: input = [BOS]         → predict "The"
Step 2: input = [BOS, The]    → predict "cat"
Step 3: input = [BOS, The, cat] → predict "sat"
...
until EOS token is predicted

Each step:
  - Run full decoder on all tokens generated so far
  - Take only the last position's output → logits over vocabulary
  - Sample or argmax → next token

KV Cache: During inference, the K and V vectors for already-generated tokens don't change. Caching them avoids recomputing them at each step, reducing complexity from O(n²) to O(n) per new token.


Sub-Layer 3: Feed-Forward Network

Same as the encoder's FFN — position-wise, same architecture, no cross-position mixing.


Decoder vs Encoder: Key Differences

| Property | Encoder | Decoder | |----------|---------|---------| | Self-attention masking | None (bidirectional) | Causal (left-only) | | Cross-attention layer | No | Yes (attends to encoder output) | | Context access | Full sequence | Past tokens only | | Typical use | Understanding tasks | Generation tasks | | Examples | BERT, RoBERTa | GPT (decoder-only), original Transformer decoder |


Interview Answer

"The decoder block has three sub-layers: (1) masked self-attention — the causal mask prevents positions from attending to future tokens, enabling autoregressive generation; (2) cross-attention — Q comes from the decoder, K and V come from the encoder output, allowing the decoder to condition on the input; (3) a feed-forward network, identical to the encoder. At inference, the decoder generates one token at a time, using a KV cache to avoid recomputing past representations."