Learnixo

Transformer Architecture Q&A · Lesson 10 of 23

Encoder-Decoder Models: T5 and BART

Encoder-Decoder Architecture

The original Transformer (Vaswani et al., 2017) is an encoder-decoder model. It has two separate stacks:

SOURCE SEQUENCE
      ↓
[Encoder Stack]
  Bidirectional self-attention (full context)
  Feed-forward network
  (N encoder blocks)
      ↓
  Encoder output: contextualised source representations
      ↓ (passed as K, V to all decoder cross-attention layers)

TARGET SEQUENCE (shifted right)
      ↓
[Decoder Stack]
  Causal self-attention (past target tokens only)
  Cross-attention: Q from decoder, K/V from encoder output
  Feed-forward network
  (N decoder blocks)
      ↓
  Next target token distribution

The encoder reads the full input bidirectionally. The decoder generates the output autoregressively, using cross-attention to "look at" the encoder's representations at each step.


Cross-Attention: The Bridge

Cross-attention is what distinguishes encoder-decoder from decoder-only:

For each decoder layer:
  Q = decoder's current representation  (what we're generating)
  K = encoder's final output            (the source)
  V = encoder's final output

CrossAttn(Q, K_enc, V_enc) = softmax(Q·K_encᵀ / √dₖ) · V_enc

Each decoder position can attend to any encoder position — no causal mask on the cross-attention. The decoder learns which source tokens are most relevant for generating each target token.

Example — translation:

Source (English): "The patient takes Warfarin daily"
Encoder output:  [h_The, h_patient, h_takes, h_Warfarin, h_daily]

Decoder generating "täglich" (German for "daily"):
  Cross-attention for "täglich" → high weight on h_daily

T5: Text-to-Text Transfer Transformer

T5 reframes all NLP tasks as text generation:

Translation:      "translate English to German: The cat sat."
                  → "Die Katze saß."

Summarisation:    "summarise: [long article text]"
                  → "Brief summary."

Classification:   "Is this review positive or negative? Great product!"
                  → "positive"

QA:               "question: What drug does the patient take? context: ..."
                  → "Warfarin"

This unified text-to-text format allows pre-training on all tasks simultaneously with the same cross-entropy loss.


T5 Variants

| Model | Params | Notes | |-------|--------|-------| | T5-small | 60M | 6 encoder + 6 decoder layers | | T5-base | 220M | 12+12 layers | | T5-large | 770M | 24+24 layers | | T5-3B | 3B | | | T5-11B | 11B | SOTA on many benchmarks at release | | Flan-T5 | 60M–11B | Instruction fine-tuned on 1800+ tasks | | BART | 140M–400M | Corrupted-text reconstruction pretraining |


BART Pretraining

BART (Lewis et al., 2019) uses a different pretraining strategy — corrupt the input and train to reconstruct it:

Corruption types:
  Token masking:   "The [MASK] takes [MASK] daily"
  Token deletion:  "The takes Warfarin"     (tokens removed)
  Text infilling:  "The [MASK] daily"       (span replaced with one mask)
  Sentence permutation: sentences shuffled
  Document rotation: start at random token

Target: always the original uncorrupted text

BART outperforms T5 on summarisation because the reconstruction objective is closer to that task.


Encoder-Decoder vs Decoder-Only for Generation

| Property | Encoder-Decoder | Decoder-Only | |----------|-----------------|--------------| | Source encoding | Bidirectional (full context) | Causal (left-to-right) | | Cross-attention | Yes (explicit source-target link) | No | | Compute | Higher (two stacks) | Lower per parameter | | Good for | Structured seq2seq (translation, summarisation) | Open-ended generation, chat | | Fine-tuning data | Efficient (supervised seq2seq) | Needs more examples | | Modern trend | Less dominant | Dominant (GPT-style) |

Decoder-only models can do translation and summarisation by treating them as text generation tasks — they just aren't architecturally optimised for it.


When to Choose Encoder-Decoder

Use encoder-decoder when:

  • Task is inherently seq2seq with distinct input/output (translation, summarisation)
  • Fine-tuning on small supervised datasets (cross-attention provides a strong inductive bias)
  • Input and output lengths differ significantly
  • You need T5/BART/Flan-T5 via HuggingFace for a structured NLP task

Use decoder-only when:

  • Open-ended generation, chat, code completion
  • Large model + in-context learning (few-shot GPT-style prompting)
  • Unified pretraining at scale

Interview Answer

"Encoder-decoder models have two separate stacks. The encoder processes the full source sequence with bidirectional self-attention, producing contextualised representations. The decoder generates the target autoregressively: each decoder block has causal self-attention (past target tokens only) plus cross-attention where Q comes from the decoder and K/V come from the encoder output. This cross-attention is the bridge — it lets the decoder attend to any source position. T5 unifies all NLP tasks into text-to-text generation. Encoder-decoder excels at structured seq2seq tasks; decoder-only dominates at scale for open-ended generation."