Transformer Architecture Q&A · Lesson 10 of 23
Encoder-Decoder Models: T5 and BART
Encoder-Decoder Architecture
The original Transformer (Vaswani et al., 2017) is an encoder-decoder model. It has two separate stacks:
SOURCE SEQUENCE
↓
[Encoder Stack]
Bidirectional self-attention (full context)
Feed-forward network
(N encoder blocks)
↓
Encoder output: contextualised source representations
↓ (passed as K, V to all decoder cross-attention layers)
TARGET SEQUENCE (shifted right)
↓
[Decoder Stack]
Causal self-attention (past target tokens only)
Cross-attention: Q from decoder, K/V from encoder output
Feed-forward network
(N decoder blocks)
↓
Next target token distributionThe encoder reads the full input bidirectionally. The decoder generates the output autoregressively, using cross-attention to "look at" the encoder's representations at each step.
Cross-Attention: The Bridge
Cross-attention is what distinguishes encoder-decoder from decoder-only:
For each decoder layer:
Q = decoder's current representation (what we're generating)
K = encoder's final output (the source)
V = encoder's final output
CrossAttn(Q, K_enc, V_enc) = softmax(Q·K_encᵀ / √dₖ) · V_encEach decoder position can attend to any encoder position — no causal mask on the cross-attention. The decoder learns which source tokens are most relevant for generating each target token.
Example — translation:
Source (English): "The patient takes Warfarin daily"
Encoder output: [h_The, h_patient, h_takes, h_Warfarin, h_daily]
Decoder generating "täglich" (German for "daily"):
Cross-attention for "täglich" → high weight on h_dailyT5: Text-to-Text Transfer Transformer
T5 reframes all NLP tasks as text generation:
Translation: "translate English to German: The cat sat."
→ "Die Katze saß."
Summarisation: "summarise: [long article text]"
→ "Brief summary."
Classification: "Is this review positive or negative? Great product!"
→ "positive"
QA: "question: What drug does the patient take? context: ..."
→ "Warfarin"This unified text-to-text format allows pre-training on all tasks simultaneously with the same cross-entropy loss.
T5 Variants
| Model | Params | Notes | |-------|--------|-------| | T5-small | 60M | 6 encoder + 6 decoder layers | | T5-base | 220M | 12+12 layers | | T5-large | 770M | 24+24 layers | | T5-3B | 3B | | | T5-11B | 11B | SOTA on many benchmarks at release | | Flan-T5 | 60M–11B | Instruction fine-tuned on 1800+ tasks | | BART | 140M–400M | Corrupted-text reconstruction pretraining |
BART Pretraining
BART (Lewis et al., 2019) uses a different pretraining strategy — corrupt the input and train to reconstruct it:
Corruption types:
Token masking: "The [MASK] takes [MASK] daily"
Token deletion: "The takes Warfarin" (tokens removed)
Text infilling: "The [MASK] daily" (span replaced with one mask)
Sentence permutation: sentences shuffled
Document rotation: start at random token
Target: always the original uncorrupted textBART outperforms T5 on summarisation because the reconstruction objective is closer to that task.
Encoder-Decoder vs Decoder-Only for Generation
| Property | Encoder-Decoder | Decoder-Only | |----------|-----------------|--------------| | Source encoding | Bidirectional (full context) | Causal (left-to-right) | | Cross-attention | Yes (explicit source-target link) | No | | Compute | Higher (two stacks) | Lower per parameter | | Good for | Structured seq2seq (translation, summarisation) | Open-ended generation, chat | | Fine-tuning data | Efficient (supervised seq2seq) | Needs more examples | | Modern trend | Less dominant | Dominant (GPT-style) |
Decoder-only models can do translation and summarisation by treating them as text generation tasks — they just aren't architecturally optimised for it.
When to Choose Encoder-Decoder
Use encoder-decoder when:
- Task is inherently seq2seq with distinct input/output (translation, summarisation)
- Fine-tuning on small supervised datasets (cross-attention provides a strong inductive bias)
- Input and output lengths differ significantly
- You need T5/BART/Flan-T5 via HuggingFace for a structured NLP task
Use decoder-only when:
- Open-ended generation, chat, code completion
- Large model + in-context learning (few-shot GPT-style prompting)
- Unified pretraining at scale
Interview Answer
"Encoder-decoder models have two separate stacks. The encoder processes the full source sequence with bidirectional self-attention, producing contextualised representations. The decoder generates the target autoregressively: each decoder block has causal self-attention (past target tokens only) plus cross-attention where Q comes from the decoder and K/V come from the encoder output. This cross-attention is the bridge — it lets the decoder attend to any source position. T5 unifies all NLP tasks into text-to-text generation. Encoder-decoder excels at structured seq2seq tasks; decoder-only dominates at scale for open-ended generation."