Transformer Architecture Q&A · Lesson 8 of 23
Encoder-Only Models: BERT and RoBERTa
What Encoder-Only Means
An encoder-only model is a stack of transformer encoder blocks with no decoder. Every token can attend to every other token in both directions — no causal masking.
Input: "The patient [MASK] Warfarin 5mg"
↓ full bidirectional attention ↓
each token attends left AND right
Output: contextualised representations for every position
Position 3 output → predicts "takes" (masked token)The output is a sequence of embeddings (one per token) that capture rich bidirectional context. These embeddings are used for downstream tasks.
Pretraining: Masked Language Modelling (MLM)
BERT is pretrained with two objectives:
Masked Language Modelling (MLM): Randomly mask 15% of tokens. Predict the masked tokens from the surrounding context.
Input: "The [MASK] is prescribed for atrial fibrillation"
Target: "Warfarin"
Training signal: cross-entropy loss on masked positions onlyThis forces bidirectional context learning — the model must use both left and right context to predict each masked token.
Next Sentence Prediction (NSP): Given two sentences, predict whether they're consecutive. (Later found to be less useful; RoBERTa removed it.)
Pretraining vs Fine-Tuning
Pretraining (self-supervised, no labels):
Corpus: Wikipedia + Books (~3.3B tokens for BERT-base)
Task: MLM + NSP
Output: General-purpose contextualised representations
Fine-tuning (supervised, task-specific labels):
Add a task head on top of the [CLS] token or all tokens
Train on small labelled dataset
Classification: [CLS] → linear layer → class probabilities
Token labelling: each token → linear layer → tag (NER, POS)
Similarity: [CLS₁] + [CLS₂] → linear → similarity scoreBERT Variants
| Model | Params | Key difference | |-------|--------|----------------| | BERT-base | 110M | 12 layers, 768 hidden, 12 heads | | BERT-large | 340M | 24 layers, 1024 hidden, 16 heads | | RoBERTa | 125M | More data, longer training, no NSP | | DeBERTa | 184M | Disentangled attention (separate position/content) | | ClinicalBERT | 110M | Fine-tuned on clinical notes (MIMIC-III) | | BioBERT | 110M | Fine-tuned on biomedical literature |
When to Use Encoder-Only
Good for:
- Text classification (sentiment, intent, medical coding)
- Named entity recognition (drug names, patient IDs)
- Question answering (extractive — find the span)
- Semantic similarity and embedding generation
- Information retrieval (bi-encoder for dense retrieval)
Not suitable for:
- Text generation (no decoder, can't autoregressively generate)
- Summarisation (generative task)
- Translation (generative task)
- Open-ended chat
Rule of thumb: If the task is about understanding existing text → encoder-only. If the task requires generating new text → decoder-only or encoder-decoder.
The [CLS] Token
BERT prepends a special [CLS] (classification) token. After the full encoder stack, the [CLS] representation is used as a pooled sentence embedding:
Input: [CLS] The patient takes Warfarin [SEP]
Output: [h_CLS, h_The, h_patient, h_takes, h_Warfarin, h_SEP]
↑
h_CLS is used for classification tasks
(it attends to all tokens and aggregates sequence-level meaning)For retrieval and similarity, mean pooling over all token representations often outperforms the [CLS] token (used in sentence-transformers).
Interview Answer
"Encoder-only models like BERT use bidirectional self-attention — every token attends to all other tokens, giving full context from both directions. They're pretrained with Masked Language Modelling: randomly mask tokens and predict them from surrounding context. The outputs are contextualised embeddings suitable for classification, NER, extractive QA, and semantic similarity. They cannot generate text natively — for generation tasks, use decoder-only or encoder-decoder models."