Learnixo
Back to blog
AI Systemsintermediate

The Transformer Decoder Block

What makes the decoder different from the encoder: masked self-attention, cross-attention, causal masking, and the autoregressive generation process.

Asma Hafeez KhanMay 16, 20263 min read
TransformersDecoderArchitectureAutoregressiveInterview
Share:๐•

Decoder Block Structure

The decoder has three sub-layers (vs two in the encoder):

Input y (previous outputs)
โ”‚
โ”œโ”€โ†’ [Masked Multi-Head Self-Attention] โ†’ + y โ†’ LayerNorm โ†’ a
โ”‚         โ†‘ causal mask (can't see future)
โ”‚
โ”œโ”€โ†’ [Cross-Attention]                  โ†’ + a โ†’ LayerNorm โ†’ b
โ”‚         โ†‘ Q from decoder, K/V from encoder output
โ”‚
โ””โ”€โ†’ [Feed-Forward Network]             โ†’ + b โ†’ LayerNorm โ†’ output

The additional sub-layer (cross-attention) is what allows the decoder to condition on the encoder's output โ€” the source sentence in translation, or the input document in summarisation.


Sub-Layer 1: Masked Self-Attention

The decoder's self-attention is causally masked โ€” position i can only attend to positions 0..i, not future positions:

Causal mask (4-token sequence):
     pos0  pos1  pos2  pos3
pos0 [  โœ“    โœ—    โœ—    โœ—  ]
pos1 [  โœ“    โœ“    โœ—    โœ—  ]
pos2 [  โœ“    โœ“    โœ“    โœ—  ]
pos3 [  โœ“    โœ“    โœ“    โœ“  ]

โœ— positions have scores set to -โˆž before softmax โ†’ attention weight = 0

Why masking: During training, the full target sequence is available (teacher forcing). Without masking, the model could "cheat" by looking at future tokens. At inference, future tokens don't exist yet โ€” the mask simulates this during training.


Sub-Layer 2: Cross-Attention (Encoder-Decoder Attention)

CrossAttention:
  Q = decoder's current representation (from masked self-attention)
  K = encoder's final output
  V = encoder's final output

CrossAttention(Q, K_enc, V_enc) = Attention(Q ยท Wแดฌ, K_enc ยท Wแดท, V_enc ยท Wแต›)

This is how the decoder "reads" the encoder. Each decoder position can attend to any encoder position โ€” no causal mask here. The decoder learns which source positions are relevant for generating each target token.


Autoregressive Generation

At inference, decoder-only or encoder-decoder models generate one token at a time:

Step 1: input = [BOS]         โ†’ predict "The"
Step 2: input = [BOS, The]    โ†’ predict "cat"
Step 3: input = [BOS, The, cat] โ†’ predict "sat"
...
until EOS token is predicted

Each step:
  - Run full decoder on all tokens generated so far
  - Take only the last position's output โ†’ logits over vocabulary
  - Sample or argmax โ†’ next token

KV Cache: During inference, the K and V vectors for already-generated tokens don't change. Caching them avoids recomputing them at each step, reducing complexity from O(nยฒ) to O(n) per new token.


Sub-Layer 3: Feed-Forward Network

Same as the encoder's FFN โ€” position-wise, same architecture, no cross-position mixing.


Decoder vs Encoder: Key Differences

| Property | Encoder | Decoder | |----------|---------|---------| | Self-attention masking | None (bidirectional) | Causal (left-only) | | Cross-attention layer | No | Yes (attends to encoder output) | | Context access | Full sequence | Past tokens only | | Typical use | Understanding tasks | Generation tasks | | Examples | BERT, RoBERTa | GPT (decoder-only), original Transformer decoder |


Interview Answer

"The decoder block has three sub-layers: (1) masked self-attention โ€” the causal mask prevents positions from attending to future tokens, enabling autoregressive generation; (2) cross-attention โ€” Q comes from the decoder, K and V come from the encoder output, allowing the decoder to condition on the input; (3) a feed-forward network, identical to the encoder. At inference, the decoder generates one token at a time, using a KV cache to avoid recomputing past representations."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:๐•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.