Transformer Architecture Q&A · Lesson 9 of 23
Decoder-Only Models: GPT, LLaMA, Mistral
What Decoder-Only Means
A decoder-only model is a stack of transformer blocks where each block uses causally masked self-attention — no cross-attention, no encoder. Every token can only attend to tokens that came before it (including itself):
Input: "The cat sat on"
↓ causal self-attention ↓
each token attends LEFT only
Output: distribution over vocabulary for NEXT token
Position 4 output → predicts "the" or "mat" etc.The causal mask ensures that during training, position i cannot see positions i+1, i+2, ... This lets the model be trained and run autoregressively.
GPT Architecture
GPT (Generative Pre-trained Transformer) is the canonical decoder-only model:
Token Embeddings + Positional Encodings
↓
[Decoder Block 1]
Masked Multi-Head Self-Attention
Feed-Forward Network
(Residual + LayerNorm at each step)
↓
[Decoder Block 2]
...
[Decoder Block N]
↓
Linear + Softmax → next-token probability distributionNo cross-attention exists because there is no encoder. The decoder attends only to itself.
Pretraining: Next-Token Prediction
GPT is pretrained with a simple objective — predict the next token given all previous tokens:
Input: "The cat sat"
Target: "cat sat on" (input shifted right by 1)
At each position i:
model sees tokens 0..i
predicts token i+1
Loss: cross-entropy over ALL positions simultaneously
(no masking of the loss, unlike MLM where only masked positions count)This is called causal language modelling (CLM) or autoregressive language modelling. It's maximally data-efficient — every token in the sequence provides a training signal.
GPT Variants
| Model | Params | Key difference | |-------|--------|----------------| | GPT-1 | 117M | Original, 12 layers | | GPT-2 | 1.5B | Larger, zero-shot capabilities emerge | | GPT-3 | 175B | Few-shot learning from context only | | GPT-4 | ~1T est. | Multimodal, RLHF-aligned | | LLaMA 2 | 7B–70B | Open weights, grouped-query attention | | Mistral 7B | 7B | Sliding window attention, GQA | | Falcon | 7B–180B | Multi-query attention |
All use the same fundamental architecture: stacked causally-masked self-attention blocks.
Autoregressive Generation
At inference, the model generates one token at a time:
import torch
import torch.nn.functional as F
def generate(model, prompt_ids, max_new_tokens=50, temperature=1.0):
ids = prompt_ids.clone()
for _ in range(max_new_tokens):
# Forward pass on all tokens generated so far
logits = model(ids) # (1, seq_len, vocab_size)
next_logits = logits[:, -1, :] # last position only
# Sample
probs = F.softmax(next_logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
ids = torch.cat([ids, next_token], dim=1)
if next_token.item() == model.eos_token_id:
break
return idsEach step feeds all previously generated tokens back in. The KV cache avoids recomputing attention for already-processed tokens.
Decoder-Only vs Encoder-Only
| Property | Encoder-Only (BERT) | Decoder-Only (GPT) | |----------|---------------------|---------------------| | Attention | Bidirectional | Causal (left-only) | | Pretraining | Masked LM (15% masked) | Next-token prediction | | Output | Contextualised embeddings | Next-token distribution | | Context | Full sequence | Past tokens only | | Good for | Classification, NER, QA | Generation, chat, summarisation | | Cannot do natively | Generation | Rich bidirectional understanding |
Why Decoder-Only Dominates Modern LLMs
Despite the seemingly weaker unidirectional attention, decoder-only models dominate because:
- Scale: Next-token prediction scales seamlessly with data and compute
- Generality: Generation subsumes most tasks — even classification can be framed as "generate the label"
- In-context learning: The autoregressive format naturally supports few-shot prompting
- No architectural mismatch: No separate encoder/decoder to balance during training
Encoder-only models still win on tasks requiring rich bidirectional representations with small fine-tuning datasets (NER, sentence similarity).
Interview Answer
"Decoder-only models like GPT use causally masked self-attention — each token can only attend to past tokens, enabling autoregressive generation. They're pretrained with next-token prediction: given all previous tokens, predict the next one, with a loss signal at every position. At inference, tokens are generated one at a time, feeding each output back as the next input. They can't do bidirectional understanding natively, but their generality at scale makes them the dominant architecture for modern LLMs."