BERT vs GPT: Encoder vs Decoder Architectures

The Core Architectural Split

Both BERT and GPT are transformers, but they differ in one critical way: which tokens each token can attend to.

BERT (encoder): Every token attends to every other token — bidirectional attention.

Token 5 can see: tokens 1, 2, 3, 4, [5], 6, 7, 8, 9

GPT (decoder): Each token can only attend to tokens that came before it — causal (left-to-right) attention.

Token 5 can see: tokens 1, 2, 3, 4, [5]
Token 5 CANNOT see: tokens 6, 7, 8, 9

This single difference drives everything else: the training objective, the tasks each architecture is good at, and how they're used.

Attention Masks

The causal mask is what creates the GPT architecture from a standard transformer:

Python

import torch
import torch.nn.functional as F

def bidirectional_attention(q, k, v):
    """BERT-style: all tokens can attend to all tokens."""
    scale = q.shape[-1] ** -0.5
    scores = torch.matmul(q, k.transpose(-2, -1)) * scale
    # No mask — full attention matrix
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)

def causal_attention(q, k, v):
    """GPT-style: each token can only attend to past tokens."""
    seq_len = q.shape[-2]
    scale = q.shape[-1] ** -0.5
    scores = torch.matmul(q, k.transpose(-2, -1)) * scale

    # Causal mask: upper triangle = -inf → softmax → 0
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    scores = scores.masked_fill(mask, float("-inf"))

    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)

Training Objectives

BERT: Masked Language Modeling (MLM)

Randomly mask 15% of input tokens, predict the masked tokens from bidirectional context:

Python

def create_mlm_batch(token_ids: list[int], mask_token_id: int, vocab_size: int, mask_prob: float = 0.15):
    """Create masked inputs and labels for MLM training."""
    import random

    inputs = token_ids.copy()
    labels = [-100] * len(token_ids)  # -100 = ignore in loss

    for i, token in enumerate(token_ids):
        if random.random() < mask_prob:
            labels[i] = token  # What we want to predict

            r = random.random()
            if r < 0.80:
                inputs[i] = mask_token_id      # 80%: replace with [MASK]
            elif r < 0.90:
                inputs[i] = random.randint(0, vocab_size - 1)  # 10%: random token
            # else: 10%: keep original (still predicted in loss)

    return inputs, labels

# Training loss: predict masked tokens using full bidirectional context
# "The patient was prescribed [MASK] for anticoagulation"
# Model sees all tokens including those after [MASK] → bidirectional

GPT: Next-Token Prediction (Causal LM)

Predict every next token, given only past tokens:

Python

def create_causal_lm_batch(token_ids: list[int]):
    """Create inputs and labels for causal LM training."""
    # Input: tokens 0 to N-1
    # Label: tokens 1 to N (shifted right by 1)
    inputs = token_ids[:-1]
    labels = token_ids[1:]
    return inputs, labels

# "The patient was prescribed warfarin for anticoagulation"
# Position 0: predict "patient" given ["The"]
# Position 1: predict "was" given ["The", "patient"]
# Position 4: predict "warfarin" given ["The", "patient", "was", "prescribed"]
# No bidirectional context — can't see future tokens

Task Suitability

| Task | BERT-style encoder | GPT-style decoder | Why | |---|---|---|---| | Text classification | Excellent | Possible but worse | Bidirectional context → richer [CLS] representation | | Named entity recognition | Excellent | Possible | Each token has full context from both sides | | Question answering (extractive) | Excellent | Possible | Seeing the full context before answering a span | | Semantic similarity | Excellent | Adequate | Can compare two sequences with both as context | | Text generation | Cannot (natively) | Excellent | Causal structure enables autoregressive generation | | Code completion | No | Excellent | Prediction of next tokens is exactly the task | | Summarization | Encoder in seq2seq | Possible with prompting | Generation requires left-to-right prediction | | Translation | Encoder in seq2seq | Possible with prompting | Same as summarization | | Zero-shot tasks | Limited | Excellent | Prompting requires generation ability |

Encoder-Decoder (T5, BART)

A third architecture combines both:

Encoder: bidirectional — reads and contextualizes the input
Decoder: causal — generates the output token by token
Cross-attention: decoder attends to encoder output at each step

Python

# T5/BART-style: encoder reads input, decoder generates output
# Input text → encoder → context representations → decoder → output text

# Used for:
# - Summarization: input=article, output=summary
# - Translation: input=English, output=French
# - Question answering (generative): input=question+context, output=answer

T5 treats every NLP task as text-to-text. BART adds noise to the input (masking, shuffling) and trains the decoder to reconstruct the original — useful for document denoising and summarization.

Practical: Using BERT for Classification

Python

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# BERT-based drug interaction classifier
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.2")
model = AutoModelForSequenceClassification.from_pretrained(
    "dmis-lab/biobert-base-cased-v1.2",
    num_labels=3,  # major, moderate, minor
)

def classify_interaction(drug_a: str, drug_b: str, context: str) -> dict:
    text = f"Drug A: {drug_a}. Drug B: {drug_b}. Context: {context}"

    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True,
    )

    with torch.no_grad():
        outputs = model(**inputs)

    probs = torch.softmax(outputs.logits, dim=-1)[0]
    labels = ["minor", "moderate", "major"]
    predicted_label = labels[probs.argmax().item()]

    return {
        "prediction": predicted_label,
        "confidence": probs.max().item(),
        "probabilities": {label: prob.item() for label, prob in zip(labels, probs)},
    }

Practical: Using GPT for Generation

Python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# GPT-style drug information generation
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct", device_map="auto")

def generate_drug_summary(drug_name: str, max_new_tokens: int = 300) -> str:
    messages = [
        {"role": "system", "content": "You are a clinical pharmacology expert."},
        {"role": "user", "content": f"Summarize the key clinical information for {drug_name}."},
    ]

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Greedy decoding for factual tasks
            temperature=1.0,
        )

    # Decode only the generated tokens (not the prompt)
    generated_tokens = outputs[0][inputs.shape[-1]:]
    return tokenizer.decode(generated_tokens, skip_special_tokens=True)

Modern Convergence

The strict encoder/decoder distinction has blurred:

GPT models with longer contexts can now do many classification tasks via prompting ("classify this as X or Y")
Instruction tuning makes decoder models behave like structured responders, covering most tasks BERT was used for
The trend is toward decoder-only: GPT-4, Claude, LLaMA, Mistral — all decoder-only

Encoder models (BERT, RoBERTa, DeBERTa) remain dominant for:

Embedding models (used in RAG systems)
Fast classification where latency matters
Tasks requiring cross-sentence comparison (NLI, semantic similarity)

For new projects requiring text generation or instruction following: decoder-only GPT-style architecture. For embedding-based retrieval or fast classification: BERT-style encoder.

BERT vs GPT: Encoder vs Decoder Architectures

The Core Architectural Split

Attention Masks

Training Objectives

Task Suitability

Encoder-Decoder (T5, BART)

Practical: Using BERT for Classification

Practical: Using GPT for Generation

Modern Convergence

Enjoyed this article?

Leave a comment