Decoder-Only Models (GPT-Style)

What Decoder-Only Means

A decoder-only model is a stack of transformer blocks where each block uses causally masked self-attention — no cross-attention, no encoder. Every token can only attend to tokens that came before it (including itself):

Input:  "The cat sat on"
        ↓ causal self-attention ↓
        each token attends LEFT only
Output: distribution over vocabulary for NEXT token
        Position 4 output → predicts "the" or "mat" etc.

The causal mask ensures that during training, position i cannot see positions i+1, i+2, ... This lets the model be trained and run autoregressively.

GPT Architecture

GPT (Generative Pre-trained Transformer) is the canonical decoder-only model:

Token Embeddings + Positional Encodings
        ↓
[Decoder Block 1]
  Masked Multi-Head Self-Attention
  Feed-Forward Network
  (Residual + LayerNorm at each step)
        ↓
[Decoder Block 2]
...
[Decoder Block N]
        ↓
Linear + Softmax → next-token probability distribution

No cross-attention exists because there is no encoder. The decoder attends only to itself.

Pretraining: Next-Token Prediction

GPT is pretrained with a simple objective — predict the next token given all previous tokens:

Input:   "The cat sat"
Target:  "cat sat on"   (input shifted right by 1)

At each position i:
  model sees tokens 0..i
  predicts token i+1

Loss: cross-entropy over ALL positions simultaneously
      (no masking of the loss, unlike MLM where only masked positions count)

This is called causal language modelling (CLM) or autoregressive language modelling. It's maximally data-efficient — every token in the sequence provides a training signal.

GPT Variants

| Model | Params | Key difference | |-------|--------|----------------| | GPT-1 | 117M | Original, 12 layers | | GPT-2 | 1.5B | Larger, zero-shot capabilities emerge | | GPT-3 | 175B | Few-shot learning from context only | | GPT-4 | ~1T est. | Multimodal, RLHF-aligned | | LLaMA 2 | 7B–70B | Open weights, grouped-query attention | | Mistral 7B | 7B | Sliding window attention, GQA | | Falcon | 7B–180B | Multi-query attention |

All use the same fundamental architecture: stacked causally-masked self-attention blocks.

Autoregressive Generation

At inference, the model generates one token at a time:

Python

import torch
import torch.nn.functional as F

def generate(model, prompt_ids, max_new_tokens=50, temperature=1.0):
    ids = prompt_ids.clone()

    for _ in range(max_new_tokens):
        # Forward pass on all tokens generated so far
        logits = model(ids)          # (1, seq_len, vocab_size)
        next_logits = logits[:, -1, :]  # last position only

        # Sample
        probs = F.softmax(next_logits / temperature, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)

        ids = torch.cat([ids, next_token], dim=1)

        if next_token.item() == model.eos_token_id:
            break

    return ids

Each step feeds all previously generated tokens back in. The KV cache avoids recomputing attention for already-processed tokens.

Decoder-Only vs Encoder-Only

| Property | Encoder-Only (BERT) | Decoder-Only (GPT) | |----------|---------------------|---------------------| | Attention | Bidirectional | Causal (left-only) | | Pretraining | Masked LM (15% masked) | Next-token prediction | | Output | Contextualised embeddings | Next-token distribution | | Context | Full sequence | Past tokens only | | Good for | Classification, NER, QA | Generation, chat, summarisation | | Cannot do natively | Generation | Rich bidirectional understanding |

Why Decoder-Only Dominates Modern LLMs

Despite the seemingly weaker unidirectional attention, decoder-only models dominate because:

Scale: Next-token prediction scales seamlessly with data and compute
Generality: Generation subsumes most tasks — even classification can be framed as "generate the label"
In-context learning: The autoregressive format naturally supports few-shot prompting
No architectural mismatch: No separate encoder/decoder to balance during training

Encoder-only models still win on tasks requiring rich bidirectional representations with small fine-tuning datasets (NER, sentence similarity).

Interview Answer

"Decoder-only models like GPT use causally masked self-attention — each token can only attend to past tokens, enabling autoregressive generation. They're pretrained with next-token prediction: given all previous tokens, predict the next one, with a loss signal at every position. At inference, tokens are generated one at a time, feeding each output back as the next input. They can't do bidirectional understanding natively, but their generality at scale makes them the dominant architecture for modern LLMs."