The Transformer Decoder Block
What makes the decoder different from the encoder: masked self-attention, cross-attention, causal masking, and the autoregressive generation process.
Decoder Block Structure
The decoder has three sub-layers (vs two in the encoder):
Input y (previous outputs)
โ
โโโ [Masked Multi-Head Self-Attention] โ + y โ LayerNorm โ a
โ โ causal mask (can't see future)
โ
โโโ [Cross-Attention] โ + a โ LayerNorm โ b
โ โ Q from decoder, K/V from encoder output
โ
โโโ [Feed-Forward Network] โ + b โ LayerNorm โ outputThe additional sub-layer (cross-attention) is what allows the decoder to condition on the encoder's output โ the source sentence in translation, or the input document in summarisation.
Sub-Layer 1: Masked Self-Attention
The decoder's self-attention is causally masked โ position i can only attend to positions 0..i, not future positions:
Causal mask (4-token sequence):
pos0 pos1 pos2 pos3
pos0 [ โ โ โ โ ]
pos1 [ โ โ โ โ ]
pos2 [ โ โ โ โ ]
pos3 [ โ โ โ โ ]
โ positions have scores set to -โ before softmax โ attention weight = 0Why masking: During training, the full target sequence is available (teacher forcing). Without masking, the model could "cheat" by looking at future tokens. At inference, future tokens don't exist yet โ the mask simulates this during training.
Sub-Layer 2: Cross-Attention (Encoder-Decoder Attention)
CrossAttention:
Q = decoder's current representation (from masked self-attention)
K = encoder's final output
V = encoder's final output
CrossAttention(Q, K_enc, V_enc) = Attention(Q ยท Wแดฌ, K_enc ยท Wแดท, V_enc ยท Wแต)This is how the decoder "reads" the encoder. Each decoder position can attend to any encoder position โ no causal mask here. The decoder learns which source positions are relevant for generating each target token.
Autoregressive Generation
At inference, decoder-only or encoder-decoder models generate one token at a time:
Step 1: input = [BOS] โ predict "The"
Step 2: input = [BOS, The] โ predict "cat"
Step 3: input = [BOS, The, cat] โ predict "sat"
...
until EOS token is predicted
Each step:
- Run full decoder on all tokens generated so far
- Take only the last position's output โ logits over vocabulary
- Sample or argmax โ next tokenKV Cache: During inference, the K and V vectors for already-generated tokens don't change. Caching them avoids recomputing them at each step, reducing complexity from O(nยฒ) to O(n) per new token.
Sub-Layer 3: Feed-Forward Network
Same as the encoder's FFN โ position-wise, same architecture, no cross-position mixing.
Decoder vs Encoder: Key Differences
| Property | Encoder | Decoder | |----------|---------|---------| | Self-attention masking | None (bidirectional) | Causal (left-only) | | Cross-attention layer | No | Yes (attends to encoder output) | | Context access | Full sequence | Past tokens only | | Typical use | Understanding tasks | Generation tasks | | Examples | BERT, RoBERTa | GPT (decoder-only), original Transformer decoder |
Interview Answer
"The decoder block has three sub-layers: (1) masked self-attention โ the causal mask prevents positions from attending to future tokens, enabling autoregressive generation; (2) cross-attention โ Q comes from the decoder, K and V come from the encoder output, allowing the decoder to condition on the input; (3) a feed-forward network, identical to the encoder. At inference, the decoder generates one token at a time, using a KV cache to avoid recomputing past representations."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.