Encoder-Only Models: BERT and RoBERTa — Transformer Architecture Q&A | Learnixo

What Encoder-Only Means

An encoder-only model is a stack of transformer encoder blocks with no decoder. Every token can attend to every other token in both directions — no causal masking.

Input:  "The patient [MASK] Warfarin 5mg"
        ↓ full bidirectional attention ↓
        each token attends left AND right
Output: contextualised representations for every position
        Position 3 output → predicts "takes" (masked token)

The output is a sequence of embeddings (one per token) that capture rich bidirectional context. These embeddings are used for downstream tasks.

Pretraining: Masked Language Modelling (MLM)

BERT is pretrained with two objectives:

Masked Language Modelling (MLM): Randomly mask 15% of tokens. Predict the masked tokens from the surrounding context.

Input:   "The [MASK] is prescribed for atrial fibrillation"
Target:  "Warfarin"

Training signal: cross-entropy loss on masked positions only

This forces bidirectional context learning — the model must use both left and right context to predict each masked token.

Next Sentence Prediction (NSP): Given two sentences, predict whether they're consecutive. (Later found to be less useful; RoBERTa removed it.)

Pretraining vs Fine-Tuning

Pretraining (self-supervised, no labels):
  Corpus: Wikipedia + Books (~3.3B tokens for BERT-base)
  Task:   MLM + NSP
  Output: General-purpose contextualised representations

Fine-tuning (supervised, task-specific labels):
  Add a task head on top of the [CLS] token or all tokens
  Train on small labelled dataset

Classification:    [CLS] → linear layer → class probabilities
Token labelling:   each token → linear layer → tag (NER, POS)
Similarity:        [CLS₁] + [CLS₂] → linear → similarity score

BERT Variants

| Model | Params | Key difference | |-------|--------|----------------| | BERT-base | 110M | 12 layers, 768 hidden, 12 heads | | BERT-large | 340M | 24 layers, 1024 hidden, 16 heads | | RoBERTa | 125M | More data, longer training, no NSP | | DeBERTa | 184M | Disentangled attention (separate position/content) | | ClinicalBERT | 110M | Fine-tuned on clinical notes (MIMIC-III) | | BioBERT | 110M | Fine-tuned on biomedical literature |

When to Use Encoder-Only

Good for:

Text classification (sentiment, intent, medical coding)
Named entity recognition (drug names, patient IDs)
Question answering (extractive — find the span)
Semantic similarity and embedding generation
Information retrieval (bi-encoder for dense retrieval)

Not suitable for:

Text generation (no decoder, can't autoregressively generate)
Summarisation (generative task)
Translation (generative task)
Open-ended chat

Rule of thumb: If the task is about understanding existing text → encoder-only. If the task requires generating new text → decoder-only or encoder-decoder.

The [CLS] Token

BERT prepends a special [CLS] (classification) token. After the full encoder stack, the [CLS] representation is used as a pooled sentence embedding:

Input: [CLS] The patient takes Warfarin [SEP]
Output: [h_CLS, h_The, h_patient, h_takes, h_Warfarin, h_SEP]
        ↑
        h_CLS is used for classification tasks
        (it attends to all tokens and aggregates sequence-level meaning)

For retrieval and similarity, mean pooling over all token representations often outperforms the [CLS] token (used in sentence-transformers).

Interview Answer

"Encoder-only models like BERT use bidirectional self-attention — every token attends to all other tokens, giving full context from both directions. They're pretrained with Masked Language Modelling: randomly mask tokens and predict them from surrounding context. The outputs are contextualised embeddings suitable for classification, NER, extractive QA, and semantic similarity. They cannot generate text natively — for generation tasks, use decoder-only or encoder-decoder models."