LLMs Deep Dive · Lesson 17 of 24
Perplexity: The Core Language Model Metric
Definition
Perplexity (PPL) measures how well a language model predicts a held-out text sequence. It's the exponentiated average negative log-likelihood per token:
PPL(W) = exp(-1/T Σᵢ log P(wᵢ | w₁, ..., wᵢ₋₁))
where:
W = sequence of T tokens
P(wᵢ | w₁,...,wᵢ₋₁) = model's predicted probability for the i-th token
Equivalently:
PPL = exp(NLL) where NLL is the mean negative log-likelihood (nats)
PPL = 2^(cross-entropy) where cross-entropy is in bitsIntuition
Perplexity measures how "surprised" a model is by the text:
Low perplexity:
Model assigned high probability to each actual token
Model "expected" this text — it fits the distribution well
Example: LLaMA 2 7B on Wikipedia text → PPL ≈ 5-8
High perplexity:
Model assigned low probability to many tokens
Model is "confused" by the text
Example: LLaMA 2 7B on its own tokeniser's OOD text → PPL ≈ 50+
Perfect model: PPL = 1 (assigns probability 1 to each token — impossible in practice)
Uniform model: PPL = vocabulary_size (~32000 for LLaMA) — random guessingComputing Perplexity in PyTorch
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(model, tokenizer, text: str, stride: int = 512) -> float:
device = next(model.parameters()).device
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(device)
seq_len = input_ids.size(1)
max_len = model.config.max_position_embeddings # e.g., 4096
total_nll = 0.0
total_tokens = 0
# Sliding window for sequences longer than max_len
for begin_idx in range(0, seq_len, stride):
end_idx = min(begin_idx + max_len, seq_len)
context_len = min(end_idx - begin_idx, stride)
input_chunk = input_ids[:, begin_idx:end_idx]
target_chunk = input_ids[:, begin_idx+1:end_idx+1]
with torch.no_grad():
logits = model(input_chunk).logits
# Only count the last `context_len` positions (avoid counting overlap)
log_probs = F.log_softmax(logits[:, -context_len-1:-1], dim=-1)
target_ids = target_chunk[:, -context_len:]
nll = F.nll_loss(log_probs.view(-1, log_probs.size(-1)),
target_ids.view(-1), reduction="sum")
total_nll += nll.item()
total_tokens += context_len
return torch.exp(torch.tensor(total_nll / total_tokens)).item()
# Usage
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").cuda()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
ppl = compute_perplexity(model, tokenizer, "The patient takes Warfarin 5mg daily.")
print(f"PPL: {ppl:.2f}")Perplexity Values in Context
GPT-2 (small, 117M) on WikiText-103: PPL ≈ 37.5
GPT-2 (large, 774M) on WikiText-103: PPL ≈ 22.0
GPT-3 (175B) on Penn Treebank: PPL ≈ 20.5
LLaMA 2 7B on Wikitext-2: PPL ≈ 5.47
LLaMA 2 70B on Wikitext-2: PPL ≈ 3.32
Lower is better. Scaling reduces PPL predictably (scaling laws).Limitations of Perplexity
1. Tokeniser dependence:
Different tokenisers produce different per-token NLL values.
GPT-2 and LLaMA 2 cannot be compared directly with PPL
even on the same text.
2. Doesn't measure task performance:
A model with PPL=5 may be worse at medical QA than one with PPL=7
if the lower-PPL model was trained on different domains.
3. Doesn't measure factual accuracy:
A model can assign high probability to plausible-but-wrong text.
4. Doesn't measure instruction following:
PPL is a pretraining metric. Aligned models (with RLHF) often
have HIGHER PPL than their base models on standard benchmarks
but are much more useful.
5. Long-context PPL requires sliding window:
Sequences longer than max_len need careful windowed evaluation.Bits Per Byte (BPB)
A tokeniser-independent alternative:
BPB = (NLL in nats) / (sequence length in bytes)
= cross-entropy in nats / bytes_per_token_average
Since different tokenisers have different tokens/character ratios,
BPB normalises by byte count — allowing fair cross-tokeniser comparison.
Lower BPB = better compression = better model.Interview Answer
"Perplexity (PPL) is exp(-1/T Σ log P(wᵢ|context)) — the exponentiated mean negative log-likelihood per token. It measures how surprised the model is by the test text: lower is better. It's the standard metric for pretraining evaluation and correlates with downstream task performance. Limitations: it's tokeniser-dependent (can't compare across tokenisers directly), it doesn't measure factual correctness or instruction following, and RLHF-aligned models often have higher PPL than base models despite being more useful. Bits-per-byte is a tokeniser-independent alternative for fair cross-model comparisons."