Perplexity: The Core Language Model Metric — LLMs Deep Dive | Learnixo

Definition

Perplexity (PPL) measures how well a language model predicts a held-out text sequence. It's the exponentiated average negative log-likelihood per token:

PPL(W) = exp(-1/T Σᵢ log P(wᵢ | w₁, ..., wᵢ₋₁))

where:
  W = sequence of T tokens
  P(wᵢ | w₁,...,wᵢ₋₁) = model's predicted probability for the i-th token

Equivalently:
  PPL = exp(NLL)  where NLL is the mean negative log-likelihood (nats)
  PPL = 2^(cross-entropy)  where cross-entropy is in bits

Intuition

Perplexity measures how "surprised" a model is by the text:

Low perplexity:
  Model assigned high probability to each actual token
  Model "expected" this text — it fits the distribution well
  Example: LLaMA 2 7B on Wikipedia text → PPL ≈ 5-8

High perplexity:
  Model assigned low probability to many tokens
  Model is "confused" by the text
  Example: LLaMA 2 7B on its own tokeniser's OOD text → PPL ≈ 50+

Perfect model: PPL = 1 (assigns probability 1 to each token — impossible in practice)
Uniform model: PPL = vocabulary_size (~32000 for LLaMA) — random guessing

Computing Perplexity in PyTorch

Python

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str, stride: int = 512) -> float:
    device = next(model.parameters()).device
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(device)
    seq_len = input_ids.size(1)
    max_len = model.config.max_position_embeddings  # e.g., 4096

    total_nll = 0.0
    total_tokens = 0

    # Sliding window for sequences longer than max_len
    for begin_idx in range(0, seq_len, stride):
        end_idx = min(begin_idx + max_len, seq_len)
        context_len = min(end_idx - begin_idx, stride)

        input_chunk = input_ids[:, begin_idx:end_idx]
        target_chunk = input_ids[:, begin_idx+1:end_idx+1]

        with torch.no_grad():
            logits = model(input_chunk).logits

        # Only count the last `context_len` positions (avoid counting overlap)
        log_probs = F.log_softmax(logits[:, -context_len-1:-1], dim=-1)
        target_ids = target_chunk[:, -context_len:]
        nll = F.nll_loss(log_probs.view(-1, log_probs.size(-1)),
                         target_ids.view(-1), reduction="sum")

        total_nll += nll.item()
        total_tokens += context_len

    return torch.exp(torch.tensor(total_nll / total_tokens)).item()

# Usage
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").cuda()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
ppl = compute_perplexity(model, tokenizer, "The patient takes Warfarin 5mg daily.")
print(f"PPL: {ppl:.2f}")

Perplexity Values in Context

GPT-2 (small, 117M) on WikiText-103: PPL ≈ 37.5
GPT-2 (large, 774M) on WikiText-103: PPL ≈ 22.0
GPT-3 (175B) on Penn Treebank:       PPL ≈ 20.5
LLaMA 2 7B on Wikitext-2:            PPL ≈ 5.47
LLaMA 2 70B on Wikitext-2:           PPL ≈ 3.32

Lower is better. Scaling reduces PPL predictably (scaling laws).

Limitations of Perplexity

1. Tokeniser dependence:
   Different tokenisers produce different per-token NLL values.
   GPT-2 and LLaMA 2 cannot be compared directly with PPL
   even on the same text.

2. Doesn't measure task performance:
   A model with PPL=5 may be worse at medical QA than one with PPL=7
   if the lower-PPL model was trained on different domains.

3. Doesn't measure factual accuracy:
   A model can assign high probability to plausible-but-wrong text.

4. Doesn't measure instruction following:
   PPL is a pretraining metric. Aligned models (with RLHF) often
   have HIGHER PPL than their base models on standard benchmarks
   but are much more useful.

5. Long-context PPL requires sliding window:
   Sequences longer than max_len need careful windowed evaluation.

Bits Per Byte (BPB)

A tokeniser-independent alternative:

BPB = (NLL in nats) / (sequence length in bytes)
    = cross-entropy in nats / bytes_per_token_average

Since different tokenisers have different tokens/character ratios,
BPB normalises by byte count — allowing fair cross-tokeniser comparison.

Lower BPB = better compression = better model.

Interview Answer

"Perplexity (PPL) is exp(-1/T Σ log P(wᵢ|context)) — the exponentiated mean negative log-likelihood per token. It measures how surprised the model is by the test text: lower is better. It's the standard metric for pretraining evaluation and correlates with downstream task performance. Limitations: it's tokeniser-dependent (can't compare across tokenisers directly), it doesn't measure factual correctness or instruction following, and RLHF-aligned models often have higher PPL than base models despite being more useful. Bits-per-byte is a tokeniser-independent alternative for fair cross-model comparisons."