Perplexity as a Language Model Metric
Understand what perplexity measures, how to compute it, and when it is — and isn't — a useful signal for evaluating language models.
Perplexity as a Language Model Metric
Perplexity is one of the oldest language model metrics. It predates the transformer era and is still widely used today — but often misapplied. This lesson explains what it actually measures, how to compute it, and when you should (and should not) use it.
What Perplexity Measures
A language model assigns a probability to every sequence of tokens. Intuitively, a well-trained model should assign high probability to text that sounds natural and low probability to gibberish.
Perplexity measures how "surprised" the model is by a piece of text. Low perplexity means the model found the text unsurprising — consistent with what it has seen during training. High perplexity means the model was not expecting this text.
Formally, for a sequence of N tokens with probabilities p(t1), p(t2), ... p(tN):
Perplexity = exp( -1/N * sum_i( log p(ti | t1...t_{i-1}) ) )This is the exponent of the average negative log-likelihood per token.
import math
import numpy as np
def compute_perplexity_from_log_probs(
log_probs: list[float],
n_tokens: int,
) -> float:
"""
Compute perplexity from per-token log probabilities.
Args:
log_probs: List of log P(token | context) for each token
n_tokens: Total number of tokens
Returns:
Perplexity score (lower is better)
"""
avg_neg_log_prob = -sum(log_probs) / n_tokens
return math.exp(avg_neg_log_prob)
# Example: model assigns probabilities to each token in a sentence
example_log_probs = [
-0.1, # "The" — very common, model is confident
-0.3, # "dog" — common after "The"
-0.4, # "sat" — plausible after "The dog"
-0.2, # "on" — highly likely after "sat"
-0.1, # "the" — nearly certain after "sat on"
-0.6, # "mat" — less common than "floor" etc.
-0.05, # "." — almost certain at end of sentence
]
ppl = compute_perplexity_from_log_probs(example_log_probs, len(example_log_probs))
print(f"Perplexity: {ppl:.2f}") # roughly 1.3 — very low, model is not surprised
# Compare: model is surprised by the text
surprising_log_probs = [
-2.5, # "Quantum"
-3.1, # "eels"
-4.2, # "photosynthetically"
-3.8, # "disambiguate"
-4.5, # "retrocognition"
-3.2, # "vestigially"
-2.9, # "."
]
ppl_surprising = compute_perplexity_from_log_probs(
surprising_log_probs, len(surprising_log_probs)
)
print(f"Perplexity (surprising text): {ppl_surprising:.2f}") # much higherA random model over a vocabulary of V words would have perplexity equal to V — it is equally surprised by every word. A perfect model that knows exactly what comes next has perplexity of 1.
Computing Perplexity with Hugging Face Transformers
In practice, you extract log probabilities from a model and aggregate them across a text.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def compute_perplexity(
text: str,
model_name: str = "gpt2",
device: str = "cpu",
stride: int = 512,
) -> float:
"""
Compute perplexity of text under a causal language model.
Uses strided evaluation to handle texts longer than the model's context window.
Args:
text: The text to evaluate
model_name: HuggingFace model identifier
device: "cpu" or "cuda"
stride: Overlap window to avoid edge effects
Returns:
Perplexity (lower = model finds text more natural)
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device)
model.eval()
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(device)
max_length = model.config.n_positions # context window
seq_len = input_ids.size(1)
nlls = []
prev_end = 0
for begin_loc in range(0, seq_len, stride):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end
input_ids_chunk = input_ids[:, begin_loc:end_loc]
target_ids = input_ids_chunk.clone()
# Mask prefix tokens that are repeated from previous window
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids_chunk, labels=target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood * trg_len)
prev_end = end_loc
if end_loc == seq_len:
break
total_nll = torch.stack(nlls).sum()
avg_nll = total_nll / seq_len
perplexity = torch.exp(avg_nll)
return perplexity.item()
# Usage
text_fluent = "The patient presented with acute chest pain and shortness of breath, consistent with possible myocardial infarction."
text_garbled = "Chest the acute patient pain with presented and breath myocardial shortness possible infarction."
ppl_fluent = compute_perplexity(text_fluent, model_name="gpt2")
ppl_garbled = compute_perplexity(text_garbled, model_name="gpt2")
print(f"Fluent text perplexity: {ppl_fluent:.1f}")
print(f"Garbled text perplexity: {ppl_garbled:.1f}")
# Fluent text will have significantly lower perplexityBatch Evaluation for Speed
Computing perplexity one example at a time is slow. Batch it:
from torch.utils.data import DataLoader, Dataset
class TextDataset(Dataset):
def __init__(self, texts: list[str], tokenizer, max_length: int = 512):
self.encodings = tokenizer(
texts,
truncation=True,
padding=True,
max_length=max_length,
return_tensors="pt",
)
def __len__(self):
return len(self.encodings["input_ids"])
def __getitem__(self, idx):
return {k: v[idx] for k, v in self.encodings.items()}
def batch_perplexity(
texts: list[str],
model_name: str = "gpt2",
batch_size: int = 8,
device: str = "cpu",
) -> list[float]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
model.eval()
dataset = TextDataset(texts, tokenizer)
loader = DataLoader(dataset, batch_size=batch_size)
per_example_ppl = []
for batch in loader:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
with torch.no_grad():
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=input_ids,
)
# Compute per-example loss from logits
shift_logits = outputs.logits[..., :-1, :].contiguous()
shift_labels = input_ids[..., 1:].contiguous()
loss_fn = torch.nn.CrossEntropyLoss(reduction="none")
loss = loss_fn(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
)
# Reshape to (batch_size, seq_len) and mask padding
loss = loss.view(input_ids.size(0), -1)
mask = (shift_labels != tokenizer.pad_token_id).float()
per_token_loss = (loss * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
ppl_batch = torch.exp(per_token_loss).tolist()
per_example_ppl.extend(ppl_batch)
return per_example_pplUsing Perplexity to Compare Models
The canonical use of perplexity is comparing two models on the same held-out test set. Lower perplexity means the model better predicts the text distribution.
def compare_models_on_corpus(
test_texts: list[str],
model_names: list[str],
device: str = "cpu",
) -> dict:
results = {}
for model_name in model_names:
ppls = batch_perplexity(test_texts, model_name=model_name, device=device)
results[model_name] = {
"mean_ppl": round(sum(ppls) / len(ppls), 2),
"median_ppl": round(sorted(ppls)[len(ppls) // 2], 2),
"n_examples": len(ppls),
}
# Rank models by mean perplexity (lower is better)
ranked = sorted(results.items(), key=lambda x: x[1]["mean_ppl"])
print("Model comparison (lower perplexity = better):")
for rank, (name, stats) in enumerate(ranked, 1):
print(f" {rank}. {name}: PPL={stats['mean_ppl']}")
return results
# Example: compare a base model vs fine-tuned model on medical text
# compare_models_on_corpus(
# test_texts=medical_eval_texts,
# model_names=["gpt2", "microsoft/biogpt"],
# )Domain Mismatch: Why Perplexity Can Mislead
Perplexity is computed relative to a specific model's training distribution. A model trained on web text will have high perplexity on legal text — not because the legal text is low quality, but because the model hasn't seen that domain.
# Illustrative example of domain mismatch
domain_examples = {
"everyday_english": [
"The weather today is warm and sunny.",
"I had coffee and toast for breakfast.",
"The kids played in the park after school.",
],
"medical_terminology": [
"The patient was diagnosed with pneumococcal bacteremia.",
"Echocardiography revealed severe mitral regurgitation.",
"Histopathological analysis confirmed adenocarcinoma.",
],
"legal_text": [
"The party of the first part hereinafter referred to as the Licensor.",
"Notwithstanding any provision to the contrary contained herein.",
"The indemnified party shall be held harmless from all claims.",
],
}
# A general-purpose model will show:
# - Low perplexity on everyday_english (in-domain)
# - Higher perplexity on medical_terminology (out-of-domain vocabulary)
# - High perplexity on legal_text (dense, archaic phrasing)
# This does NOT mean medical/legal text is worse — just unfamiliar to the model
print("Domain mismatch warning: always evaluate on in-domain text")
print("Compare models trained on the same or similar domains")Limitations of Perplexity
1. Good perplexity does not mean good answers
A model can have low perplexity by generating fluent but factually wrong text. "The capital of France is Berlin" is grammatically perfect and might have low perplexity under a poorly trained model.
2. Cannot compare across models with different tokenizers
GPT-4's perplexity is not comparable to Llama 3's perplexity because they use different tokenizers (different vocabulary, different token boundaries).
# Why you cannot compare perplexity across different tokenizers
# This code illustrates the problem conceptually
tokenizers = {
"gpt2": "1000+ tokens in vocabulary",
"llama3": "32000 tokens, different byte-pair encoding",
"claude": "Different tokenization entirely",
}
# Each tokenizer splits text differently
# The "probability of a token" is therefore a different thing
# Perplexity scores are only comparable within the same tokenizer family
def warn_cross_model_comparison(model_a: str, model_b: str) -> None:
print(f"WARNING: Comparing perplexity of {model_a} vs {model_b}")
print("These models likely use different tokenizers.")
print("Perplexity comparison is only valid within the same tokenizer.")3. Not useful for instruction-following quality
Perplexity measures how well the model predicts the next token. It says nothing about whether the model follows instructions, answers questions correctly, or produces safe outputs.
When to Use Perplexity
| Use Case | Appropriate? | |----------|-------------| | Comparing two versions of the same model architecture | Yes | | Measuring how well a fine-tuned model fits the target domain | Yes | | Detecting distribution shift in incoming requests | Yes | | Measuring hallucination rate | No | | Evaluating instruction following | No | | Comparing GPT-4 vs Llama 3 | No (different tokenizers) | | Measuring generation quality for end users | No |
Key Takeaways
- Perplexity = how surprised the model is by a text. Lower is better, relative to a baseline.
- Formula: exp(-1/N times the sum of log probabilities of each token given its context).
- Compute it with Hugging Face transformers using strided evaluation for long texts.
- Only compare perplexity across models with the same tokenizer.
- Good perplexity does not imply factual accuracy, helpfulness, or safety.
- Best use: comparing fine-tuned vs base model on the target domain, or detecting distribution shift.
What's Next
In eval-bleu.mdx, you will learn about BLEU score — one of the oldest and most widely cited text generation metrics, its strengths, and its significant weaknesses.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.