BERT vs GPT: Encoder vs Decoder Architectures
Compare BERT's bidirectional encoder and GPT's causal decoder. Understand masked language modeling vs next-token prediction, and which architecture fits which task.
The Core Architectural Split
Both BERT and GPT are transformers, but they differ in one critical way: which tokens each token can attend to.
BERT (encoder): Every token attends to every other token — bidirectional attention.
Token 5 can see: tokens 1, 2, 3, 4, [5], 6, 7, 8, 9GPT (decoder): Each token can only attend to tokens that came before it — causal (left-to-right) attention.
Token 5 can see: tokens 1, 2, 3, 4, [5]
Token 5 CANNOT see: tokens 6, 7, 8, 9This single difference drives everything else: the training objective, the tasks each architecture is good at, and how they're used.
Attention Masks
The causal mask is what creates the GPT architecture from a standard transformer:
import torch
import torch.nn.functional as F
def bidirectional_attention(q, k, v):
"""BERT-style: all tokens can attend to all tokens."""
scale = q.shape[-1] ** -0.5
scores = torch.matmul(q, k.transpose(-2, -1)) * scale
# No mask — full attention matrix
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, v)
def causal_attention(q, k, v):
"""GPT-style: each token can only attend to past tokens."""
seq_len = q.shape[-2]
scale = q.shape[-1] ** -0.5
scores = torch.matmul(q, k.transpose(-2, -1)) * scale
# Causal mask: upper triangle = -inf → softmax → 0
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
scores = scores.masked_fill(mask, float("-inf"))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, v)Training Objectives
BERT: Masked Language Modeling (MLM)
Randomly mask 15% of input tokens, predict the masked tokens from bidirectional context:
def create_mlm_batch(token_ids: list[int], mask_token_id: int, vocab_size: int, mask_prob: float = 0.15):
"""Create masked inputs and labels for MLM training."""
import random
inputs = token_ids.copy()
labels = [-100] * len(token_ids) # -100 = ignore in loss
for i, token in enumerate(token_ids):
if random.random() < mask_prob:
labels[i] = token # What we want to predict
r = random.random()
if r < 0.80:
inputs[i] = mask_token_id # 80%: replace with [MASK]
elif r < 0.90:
inputs[i] = random.randint(0, vocab_size - 1) # 10%: random token
# else: 10%: keep original (still predicted in loss)
return inputs, labels
# Training loss: predict masked tokens using full bidirectional context
# "The patient was prescribed [MASK] for anticoagulation"
# Model sees all tokens including those after [MASK] → bidirectionalGPT: Next-Token Prediction (Causal LM)
Predict every next token, given only past tokens:
def create_causal_lm_batch(token_ids: list[int]):
"""Create inputs and labels for causal LM training."""
# Input: tokens 0 to N-1
# Label: tokens 1 to N (shifted right by 1)
inputs = token_ids[:-1]
labels = token_ids[1:]
return inputs, labels
# "The patient was prescribed warfarin for anticoagulation"
# Position 0: predict "patient" given ["The"]
# Position 1: predict "was" given ["The", "patient"]
# Position 4: predict "warfarin" given ["The", "patient", "was", "prescribed"]
# No bidirectional context — can't see future tokensTask Suitability
| Task | BERT-style encoder | GPT-style decoder | Why | |---|---|---|---| | Text classification | Excellent | Possible but worse | Bidirectional context → richer [CLS] representation | | Named entity recognition | Excellent | Possible | Each token has full context from both sides | | Question answering (extractive) | Excellent | Possible | Seeing the full context before answering a span | | Semantic similarity | Excellent | Adequate | Can compare two sequences with both as context | | Text generation | Cannot (natively) | Excellent | Causal structure enables autoregressive generation | | Code completion | No | Excellent | Prediction of next tokens is exactly the task | | Summarization | Encoder in seq2seq | Possible with prompting | Generation requires left-to-right prediction | | Translation | Encoder in seq2seq | Possible with prompting | Same as summarization | | Zero-shot tasks | Limited | Excellent | Prompting requires generation ability |
Encoder-Decoder (T5, BART)
A third architecture combines both:
Encoder: bidirectional — reads and contextualizes the input
Decoder: causal — generates the output token by token
Cross-attention: decoder attends to encoder output at each step# T5/BART-style: encoder reads input, decoder generates output
# Input text → encoder → context representations → decoder → output text
# Used for:
# - Summarization: input=article, output=summary
# - Translation: input=English, output=French
# - Question answering (generative): input=question+context, output=answerT5 treats every NLP task as text-to-text. BART adds noise to the input (masking, shuffling) and trains the decoder to reconstruct the original — useful for document denoising and summarization.
Practical: Using BERT for Classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# BERT-based drug interaction classifier
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.2")
model = AutoModelForSequenceClassification.from_pretrained(
"dmis-lab/biobert-base-cased-v1.2",
num_labels=3, # major, moderate, minor
)
def classify_interaction(drug_a: str, drug_b: str, context: str) -> dict:
text = f"Drug A: {drug_a}. Drug B: {drug_b}. Context: {context}"
inputs = tokenizer(
text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True,
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)[0]
labels = ["minor", "moderate", "major"]
predicted_label = labels[probs.argmax().item()]
return {
"prediction": predicted_label,
"confidence": probs.max().item(),
"probabilities": {label: prob.item() for label, prob in zip(labels, probs)},
}Practical: Using GPT for Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# GPT-style drug information generation
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct", device_map="auto")
def generate_drug_summary(drug_name: str, max_new_tokens: int = 300) -> str:
messages = [
{"role": "system", "content": "You are a clinical pharmacology expert."},
{"role": "user", "content": f"Summarize the key clinical information for {drug_name}."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=max_new_tokens,
do_sample=False, # Greedy decoding for factual tasks
temperature=1.0,
)
# Decode only the generated tokens (not the prompt)
generated_tokens = outputs[0][inputs.shape[-1]:]
return tokenizer.decode(generated_tokens, skip_special_tokens=True)Modern Convergence
The strict encoder/decoder distinction has blurred:
- GPT models with longer contexts can now do many classification tasks via prompting ("classify this as X or Y")
- Instruction tuning makes decoder models behave like structured responders, covering most tasks BERT was used for
- The trend is toward decoder-only: GPT-4, Claude, LLaMA, Mistral — all decoder-only
Encoder models (BERT, RoBERTa, DeBERTa) remain dominant for:
- Embedding models (used in RAG systems)
- Fast classification where latency matters
- Tasks requiring cross-sentence comparison (NLI, semantic similarity)
For new projects requiring text generation or instruction following: decoder-only GPT-style architecture. For embedding-based retrieval or fast classification: BERT-style encoder.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.