Interview Q&A: Encoder, Decoder, and Architecture Variants
Common interview questions on encoder vs decoder blocks, encoder-only vs decoder-only vs encoder-decoder models, and when to choose each architecture.
Q: What's the difference between an encoder and decoder block?
| Property | Encoder Block | Decoder Block | |----------|--------------|---------------| | Self-attention | Bidirectional (no mask) | Causal (left-only mask) | | Cross-attention | No | Yes (Q from decoder, K/V from encoder) | | Sub-layers | 2 | 3 | | Context | Full sequence | Past tokens only |
The decoder's causal mask and cross-attention layer are what distinguish it from the encoder.
Q: Why can't BERT generate text?
BERT is encoder-only with bidirectional attention. To generate token at position i, a model needs to predict i without having seen position i. BERT's architecture gives position i access to all other positions — including position i+1, i+2, etc. It has no mechanism to generate tokens autoregressively.
Also, BERT is pretrained with MLM (predict masked tokens from context), not with next-token prediction. Its output at each position is a probability distribution over the full vocabulary conditioned on all context — not the left context only.
Q: Why can't GPT do bidirectional understanding as well as BERT?
GPT's causal mask prevents position i from seeing positions i+1, i+2, ... This means GPT never builds representations using right context. Tasks like NER (tagging a token based on full sentence context) or sentence similarity (comparing two full representations) benefit from bidirectional context that GPT structurally lacks.
In practice, GPT-scale models (GPT-3+) partially overcome this through in-context learning — the model has seen enough data that it can infer context from patterns — but architecturally, BERT-style bidirectionality is the better fit for classification/NER.
Q: When would you choose encoder-only vs decoder-only vs encoder-decoder?
Encoder-only (BERT, RoBERTa, ClinicalBERT):
✓ Text classification (sentiment, ICD coding)
✓ Named entity recognition
✓ Extractive QA (find span in passage)
✓ Sentence similarity and embedding generation
✗ Text generation
Decoder-only (GPT, LLaMA, Mistral):
✓ Text generation, chat, code completion
✓ In-context learning (few-shot)
✓ Summarisation and translation (at large scale)
✗ Efficient fine-tuning on small supervised datasets (no bidirectional context)
Encoder-decoder (T5, BART, Flan-T5):
✓ Structured seq2seq (translation, summarisation)
✓ Efficient fine-tuning on small supervised datasets
✓ Abstractive QA
✗ Open-ended generation at large scaleQ: What is teacher forcing?
During training of a decoder, teacher forcing feeds the ground-truth target tokens as input instead of the model's own predictions:
Generating "The cat sat":
Without teacher forcing:
Step 1: input=[BOS] → predict "The" ✓
Step 2: input=[BOS, The] → predict "dog" ✗ (error propagates)
Step 3: input=[BOS, The, dog] → model is off-track
With teacher forcing (training):
Step 1: input=[BOS] → predict "The" (compare to "The")
Step 2: input=[BOS, The] → predict "cat" (compare to "cat")
Step 3: input=[BOS, The, cat] → predict "sat" (compare to "sat")
Ground truth inputs always used, regardless of model predictionsTeacher forcing makes training stable and fast (no error accumulation) but creates a train/inference mismatch. At inference, the model must use its own (possibly wrong) outputs.
Q: What is exposure bias?
Exposure bias is the train/inference mismatch caused by teacher forcing. During training, the model always sees ground truth tokens. During inference, it sees its own generated tokens — which can be wrong. Over long sequences, small errors compound.
Mitigations:
- Scheduled sampling: gradually replace ground truth with model predictions during training
- Reinforcement learning (RLHF): train directly on generation quality
- Direct Preference Optimisation (DPO): contrastive training on preferred vs rejected outputs
Q: What is the bottleneck in encoder-decoder models?
The encoder's final representation must capture all information needed for generation — it's a fixed-size representation (seq_len × d_model) that must convey the entire source. For very long inputs, this can become a bottleneck.
Cross-attention partially solves this: the decoder can attend to any encoder position, not just the final layer's pooled representation. This is why the original encoder-decoder (with cross-attention at each decoder layer) significantly outperforms the sequence-to-vector-to-sequence approach.
Interview Answer Template
"Encoder blocks use bidirectional self-attention with no masking — every token attends to every other. Decoder blocks use causal self-attention (left-only) plus cross-attention to the encoder output. Encoder-only (BERT) excels at understanding tasks with bidirectional context; decoder-only (GPT/LLaMA) excels at generation at scale; encoder-decoder (T5) excels at structured seq2seq. The choice depends on task structure: classification and NER → encoder-only; open generation → decoder-only; translation/summarisation with supervised fine-tuning → encoder-decoder."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.