Interview Q&A: Encoder, Decoder, and Architecture Variants

Q: What's the difference between an encoder and decoder block?

| Property | Encoder Block | Decoder Block | |----------|--------------|---------------| | Self-attention | Bidirectional (no mask) | Causal (left-only mask) | | Cross-attention | No | Yes (Q from decoder, K/V from encoder) | | Sub-layers | 2 | 3 | | Context | Full sequence | Past tokens only |

The decoder's causal mask and cross-attention layer are what distinguish it from the encoder.

Q: Why can't BERT generate text?

BERT is encoder-only with bidirectional attention. To generate token at position i, a model needs to predict i without having seen position i. BERT's architecture gives position i access to all other positions — including position i+1, i+2, etc. It has no mechanism to generate tokens autoregressively.

Also, BERT is pretrained with MLM (predict masked tokens from context), not with next-token prediction. Its output at each position is a probability distribution over the full vocabulary conditioned on all context — not the left context only.

Q: Why can't GPT do bidirectional understanding as well as BERT?

GPT's causal mask prevents position i from seeing positions i+1, i+2, ... This means GPT never builds representations using right context. Tasks like NER (tagging a token based on full sentence context) or sentence similarity (comparing two full representations) benefit from bidirectional context that GPT structurally lacks.

In practice, GPT-scale models (GPT-3+) partially overcome this through in-context learning — the model has seen enough data that it can infer context from patterns — but architecturally, BERT-style bidirectionality is the better fit for classification/NER.

Q: When would you choose encoder-only vs decoder-only vs encoder-decoder?

Encoder-only (BERT, RoBERTa, ClinicalBERT):
  ✓ Text classification (sentiment, ICD coding)
  ✓ Named entity recognition
  ✓ Extractive QA (find span in passage)
  ✓ Sentence similarity and embedding generation
  ✗ Text generation

Decoder-only (GPT, LLaMA, Mistral):
  ✓ Text generation, chat, code completion
  ✓ In-context learning (few-shot)
  ✓ Summarisation and translation (at large scale)
  ✗ Efficient fine-tuning on small supervised datasets (no bidirectional context)

Encoder-decoder (T5, BART, Flan-T5):
  ✓ Structured seq2seq (translation, summarisation)
  ✓ Efficient fine-tuning on small supervised datasets
  ✓ Abstractive QA
  ✗ Open-ended generation at large scale

Q: What is teacher forcing?

During training of a decoder, teacher forcing feeds the ground-truth target tokens as input instead of the model's own predictions:

Generating "The cat sat":

Without teacher forcing:
  Step 1: input=[BOS] → predict "The"  ✓
  Step 2: input=[BOS, The] → predict "dog"  ✗ (error propagates)
  Step 3: input=[BOS, The, dog] → model is off-track

With teacher forcing (training):
  Step 1: input=[BOS] → predict "The"  (compare to "The")
  Step 2: input=[BOS, The] → predict "cat"  (compare to "cat")
  Step 3: input=[BOS, The, cat] → predict "sat"  (compare to "sat")
  Ground truth inputs always used, regardless of model predictions

Teacher forcing makes training stable and fast (no error accumulation) but creates a train/inference mismatch. At inference, the model must use its own (possibly wrong) outputs.

Q: What is exposure bias?

Exposure bias is the train/inference mismatch caused by teacher forcing. During training, the model always sees ground truth tokens. During inference, it sees its own generated tokens — which can be wrong. Over long sequences, small errors compound.

Mitigations:

Scheduled sampling: gradually replace ground truth with model predictions during training
Reinforcement learning (RLHF): train directly on generation quality
Direct Preference Optimisation (DPO): contrastive training on preferred vs rejected outputs

Q: What is the bottleneck in encoder-decoder models?

The encoder's final representation must capture all information needed for generation — it's a fixed-size representation (seq_len × d_model) that must convey the entire source. For very long inputs, this can become a bottleneck.

Cross-attention partially solves this: the decoder can attend to any encoder position, not just the final layer's pooled representation. This is why the original encoder-decoder (with cross-attention at each decoder layer) significantly outperforms the sequence-to-vector-to-sequence approach.

Interview Answer Template

"Encoder blocks use bidirectional self-attention with no masking — every token attends to every other. Decoder blocks use causal self-attention (left-only) plus cross-attention to the encoder output. Encoder-only (BERT) excels at understanding tasks with bidirectional context; decoder-only (GPT/LLaMA) excels at generation at scale; encoder-decoder (T5) excels at structured seq2seq. The choice depends on task structure: classification and NER → encoder-only; open generation → decoder-only; translation/summarisation with supervised fine-tuning → encoder-decoder."