Perplexity as a Language Model Metric

Perplexity is one of the oldest language model metrics. It predates the transformer era and is still widely used today — but often misapplied. This lesson explains what it actually measures, how to compute it, and when you should (and should not) use it.

What Perplexity Measures

A language model assigns a probability to every sequence of tokens. Intuitively, a well-trained model should assign high probability to text that sounds natural and low probability to gibberish.

Perplexity measures how "surprised" the model is by a piece of text. Low perplexity means the model found the text unsurprising — consistent with what it has seen during training. High perplexity means the model was not expecting this text.

Formally, for a sequence of N tokens with probabilities p(t1), p(t2), ... p(tN):

Perplexity = exp( -1/N * sum_i( log p(ti | t1...t_{i-1}) ) )

This is the exponent of the average negative log-likelihood per token.

Python

import math
import numpy as np

def compute_perplexity_from_log_probs(
    log_probs: list[float],
    n_tokens: int,
) -> float:
    """
    Compute perplexity from per-token log probabilities.
    
    Args:
        log_probs: List of log P(token | context) for each token
        n_tokens: Total number of tokens
    
    Returns:
        Perplexity score (lower is better)
    """
    avg_neg_log_prob = -sum(log_probs) / n_tokens
    return math.exp(avg_neg_log_prob)


# Example: model assigns probabilities to each token in a sentence
example_log_probs = [
    -0.1,   # "The"    — very common, model is confident
    -0.3,   # "dog"    — common after "The"
    -0.4,   # "sat"    — plausible after "The dog"
    -0.2,   # "on"     — highly likely after "sat"
    -0.1,   # "the"    — nearly certain after "sat on"
    -0.6,   # "mat"    — less common than "floor" etc.
    -0.05,  # "."      — almost certain at end of sentence
]

ppl = compute_perplexity_from_log_probs(example_log_probs, len(example_log_probs))
print(f"Perplexity: {ppl:.2f}")  # roughly 1.3 — very low, model is not surprised

# Compare: model is surprised by the text
surprising_log_probs = [
    -2.5,   # "Quantum"
    -3.1,   # "eels"
    -4.2,   # "photosynthetically"
    -3.8,   # "disambiguate"
    -4.5,   # "retrocognition"
    -3.2,   # "vestigially"
    -2.9,   # "."
]

ppl_surprising = compute_perplexity_from_log_probs(
    surprising_log_probs, len(surprising_log_probs)
)
print(f"Perplexity (surprising text): {ppl_surprising:.2f}")  # much higher

A random model over a vocabulary of V words would have perplexity equal to V — it is equally surprised by every word. A perfect model that knows exactly what comes next has perplexity of 1.

Computing Perplexity with Hugging Face Transformers

In practice, you extract log probabilities from a model and aggregate them across a text.

Python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def compute_perplexity(
    text: str,
    model_name: str = "gpt2",
    device: str = "cpu",
    stride: int = 512,
) -> float:
    """
    Compute perplexity of text under a causal language model.
    
    Uses strided evaluation to handle texts longer than the model's context window.
    
    Args:
        text: The text to evaluate
        model_name: HuggingFace model identifier
        device: "cpu" or "cuda"
        stride: Overlap window to avoid edge effects
    
    Returns:
        Perplexity (lower = model finds text more natural)
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model = model.to(device)
    model.eval()
    
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(device)
    
    max_length = model.config.n_positions  # context window
    seq_len = input_ids.size(1)
    
    nlls = []
    prev_end = 0
    
    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end
        input_ids_chunk = input_ids[:, begin_loc:end_loc]
        
        target_ids = input_ids_chunk.clone()
        # Mask prefix tokens that are repeated from previous window
        target_ids[:, :-trg_len] = -100
        
        with torch.no_grad():
            outputs = model(input_ids_chunk, labels=target_ids)
            neg_log_likelihood = outputs.loss
        
        nlls.append(neg_log_likelihood * trg_len)
        prev_end = end_loc
        
        if end_loc == seq_len:
            break
    
    total_nll = torch.stack(nlls).sum()
    avg_nll = total_nll / seq_len
    perplexity = torch.exp(avg_nll)
    
    return perplexity.item()


# Usage
text_fluent = "The patient presented with acute chest pain and shortness of breath, consistent with possible myocardial infarction."
text_garbled = "Chest the acute patient pain with presented and breath myocardial shortness possible infarction."

ppl_fluent = compute_perplexity(text_fluent, model_name="gpt2")
ppl_garbled = compute_perplexity(text_garbled, model_name="gpt2")

print(f"Fluent text perplexity: {ppl_fluent:.1f}")
print(f"Garbled text perplexity: {ppl_garbled:.1f}")
# Fluent text will have significantly lower perplexity

Batch Evaluation for Speed

Computing perplexity one example at a time is slow. Batch it:

Python

from torch.utils.data import DataLoader, Dataset

class TextDataset(Dataset):
    def __init__(self, texts: list[str], tokenizer, max_length: int = 512):
        self.encodings = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=max_length,
            return_tensors="pt",
        )
    
    def __len__(self):
        return len(self.encodings["input_ids"])
    
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.encodings.items()}


def batch_perplexity(
    texts: list[str],
    model_name: str = "gpt2",
    batch_size: int = 8,
    device: str = "cpu",
) -> list[float]:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    model.eval()
    
    dataset = TextDataset(texts, tokenizer)
    loader = DataLoader(dataset, batch_size=batch_size)
    
    per_example_ppl = []
    
    for batch in loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        
        with torch.no_grad():
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=input_ids,
            )
        
        # Compute per-example loss from logits
        shift_logits = outputs.logits[..., :-1, :].contiguous()
        shift_labels = input_ids[..., 1:].contiguous()
        
        loss_fn = torch.nn.CrossEntropyLoss(reduction="none")
        loss = loss_fn(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1),
        )
        
        # Reshape to (batch_size, seq_len) and mask padding
        loss = loss.view(input_ids.size(0), -1)
        mask = (shift_labels != tokenizer.pad_token_id).float()
        per_token_loss = (loss * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
        
        ppl_batch = torch.exp(per_token_loss).tolist()
        per_example_ppl.extend(ppl_batch)
    
    return per_example_ppl

Using Perplexity to Compare Models

The canonical use of perplexity is comparing two models on the same held-out test set. Lower perplexity means the model better predicts the text distribution.

Python

def compare_models_on_corpus(
    test_texts: list[str],
    model_names: list[str],
    device: str = "cpu",
) -> dict:
    results = {}
    
    for model_name in model_names:
        ppls = batch_perplexity(test_texts, model_name=model_name, device=device)
        results[model_name] = {
            "mean_ppl": round(sum(ppls) / len(ppls), 2),
            "median_ppl": round(sorted(ppls)[len(ppls) // 2], 2),
            "n_examples": len(ppls),
        }
    
    # Rank models by mean perplexity (lower is better)
    ranked = sorted(results.items(), key=lambda x: x[1]["mean_ppl"])
    
    print("Model comparison (lower perplexity = better):")
    for rank, (name, stats) in enumerate(ranked, 1):
        print(f"  {rank}. {name}: PPL={stats['mean_ppl']}")
    
    return results


# Example: compare a base model vs fine-tuned model on medical text
# compare_models_on_corpus(
#     test_texts=medical_eval_texts,
#     model_names=["gpt2", "microsoft/biogpt"],
# )

Domain Mismatch: Why Perplexity Can Mislead

Perplexity is computed relative to a specific model's training distribution. A model trained on web text will have high perplexity on legal text — not because the legal text is low quality, but because the model hasn't seen that domain.

Python

# Illustrative example of domain mismatch
domain_examples = {
    "everyday_english": [
        "The weather today is warm and sunny.",
        "I had coffee and toast for breakfast.",
        "The kids played in the park after school.",
    ],
    "medical_terminology": [
        "The patient was diagnosed with pneumococcal bacteremia.",
        "Echocardiography revealed severe mitral regurgitation.",
        "Histopathological analysis confirmed adenocarcinoma.",
    ],
    "legal_text": [
        "The party of the first part hereinafter referred to as the Licensor.",
        "Notwithstanding any provision to the contrary contained herein.",
        "The indemnified party shall be held harmless from all claims.",
    ],
}

# A general-purpose model will show:
# - Low perplexity on everyday_english (in-domain)
# - Higher perplexity on medical_terminology (out-of-domain vocabulary)
# - High perplexity on legal_text (dense, archaic phrasing)

# This does NOT mean medical/legal text is worse — just unfamiliar to the model
print("Domain mismatch warning: always evaluate on in-domain text")
print("Compare models trained on the same or similar domains")

Limitations of Perplexity

1. Good perplexity does not mean good answers

A model can have low perplexity by generating fluent but factually wrong text. "The capital of France is Berlin" is grammatically perfect and might have low perplexity under a poorly trained model.

2. Cannot compare across models with different tokenizers

GPT-4's perplexity is not comparable to Llama 3's perplexity because they use different tokenizers (different vocabulary, different token boundaries).

Python

# Why you cannot compare perplexity across different tokenizers
# This code illustrates the problem conceptually

tokenizers = {
    "gpt2": "1000+ tokens in vocabulary",
    "llama3": "32000 tokens, different byte-pair encoding",
    "claude": "Different tokenization entirely",
}

# Each tokenizer splits text differently
# The "probability of a token" is therefore a different thing
# Perplexity scores are only comparable within the same tokenizer family

def warn_cross_model_comparison(model_a: str, model_b: str) -> None:
    print(f"WARNING: Comparing perplexity of {model_a} vs {model_b}")
    print("These models likely use different tokenizers.")
    print("Perplexity comparison is only valid within the same tokenizer.")

3. Not useful for instruction-following quality

Perplexity measures how well the model predicts the next token. It says nothing about whether the model follows instructions, answers questions correctly, or produces safe outputs.

When to Use Perplexity

| Use Case | Appropriate? | |----------|-------------| | Comparing two versions of the same model architecture | Yes | | Measuring how well a fine-tuned model fits the target domain | Yes | | Detecting distribution shift in incoming requests | Yes | | Measuring hallucination rate | No | | Evaluating instruction following | No | | Comparing GPT-4 vs Llama 3 | No (different tokenizers) | | Measuring generation quality for end users | No |

Key Takeaways

Perplexity = how surprised the model is by a text. Lower is better, relative to a baseline.
Formula: exp(-1/N times the sum of log probabilities of each token given its context).
Compute it with Hugging Face transformers using strided evaluation for long texts.
Only compare perplexity across models with the same tokenizer.
Good perplexity does not imply factual accuracy, helpfulness, or safety.
Best use: comparing fine-tuned vs base model on the target domain, or detecting distribution shift.

What's Next

In eval-bleu.mdx, you will learn about BLEU score — one of the oldest and most widely cited text generation metrics, its strengths, and its significant weaknesses.

Perplexity as a Language Model Metric

Perplexity as a Language Model Metric

What Perplexity Measures

Computing Perplexity with Hugging Face Transformers

Batch Evaluation for Speed

Using Perplexity to Compare Models

Domain Mismatch: Why Perplexity Can Mislead

Limitations of Perplexity

When to Use Perplexity

Key Takeaways

What's Next

Enjoyed this article?

Leave a comment