LLM Benchmarks: What They Measure and What They Don't
Deep dive into LLM benchmarks: MMLU, HumanEval, GSM8K, HellaSwag, MATH, and more. How to interpret benchmark scores, their limitations, and how to build your own evaluations.
Why Benchmarks Are Hard
Every benchmark has a fundamental problem: once a benchmark is widely known, models are trained on it (intentionally or through data contamination), and scores inflate without reflecting real capability improvement. This is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
Benchmark lifecycle:
- Benchmark is created to measure capability X
- Models improve on it through training on the benchmark distribution
- A model scores 95% — the benchmark is "saturated"
- Researchers create a harder benchmark to measure capability X
Understanding what each benchmark actually tests — and where it fails — is essential for interpreting LLM comparisons.
MMLU: Multitask Language Understanding
What it tests: Multiple-choice questions across 57 subjects (anatomy, law, physics, history, etc.)
# MMLU example structure
MMLU_EXAMPLE = {
"subject": "clinical_pharmacology",
"question": "Warfarin's anticoagulant effect is primarily due to inhibition of:",
"choices": [
"A. Thrombin directly",
"B. Vitamin K-dependent clotting factors",
"C. Platelet aggregation",
"D. Fibrinogen synthesis",
],
"answer": "B",
}
# Common evaluation approach: 5-shot prompting
def evaluate_mmlu(model, examples: list[dict], n_shot: int = 5) -> float:
correct = 0
for item in examples:
# Build few-shot prompt
few_shot = "\n\n".join([
f"Q: {ex['question']}\nA: {ex['answer']}"
for ex in examples[:n_shot]
])
prompt = f"{few_shot}\n\nQ: {item['question']}\nA:"
response = model.generate(prompt, max_tokens=1)
predicted = response.strip()[0] # Just take the first letter
correct += predicted == item["answer"]
return correct / len(examples)MMLU limitations:
- Multiple-choice format rewards systematic guessing strategies
- Some questions have disputed correct answers
- Contamination: MMLU questions are widely available online
- Doesn't test reasoning — a model can score 80% without understanding the material
- Measures breadth, not depth — 50 questions per domain can't probe deep expertise
Typical scores: Random baseline 25%; GPT-4 86%; Claude 3.5 Sonnet 88%; Llama-3-70B 82%
HumanEval: Code Generation
What it tests: Write a Python function given docstring + signature; tests run automatically.
# HumanEval example
HUMANEVAL_EXAMPLE = {
"task_id": "HumanEval/7",
"prompt": """def filter_by_substring(strings: List[str], substring: str) -> List[str]:
\"\"\" Filter an input list of strings only for ones that contain given substring
>>> filter_by_substring([], 'a')
[]
>>> filter_by_substring(['abc', 'bacd', 'cde', 'array'], 'a')
['abc', 'bacd', 'array']
\"\"\"
""",
"canonical_solution": " return [x for x in strings if substring in x]\n",
"test": """
def check(filter_by_substring):
assert filter_by_substring([], 'a') == []
assert filter_by_substring(['abc', 'bacd', 'cde', 'array'], 'a') == ['abc', 'bacd', 'array']
""",
"entry_point": "filter_by_substring",
}
# pass@k metric: generate k samples, problem passes if at least 1 is correct
def pass_at_k(n_correct: int, n_samples: int, k: int) -> float:
"""
Unbiased estimate of pass@k.
n_correct: number of correct samples out of n_samples.
"""
if n_correct == 0:
return 0.0
if n_correct == n_samples:
return 1.0
# Complement: probability that all k samples are wrong
from math import comb
p_all_wrong = comb(n_samples - n_correct, k) / comb(n_samples, k)
return 1 - p_all_wrong
# pass@1 for GPT-4: ~87%
# pass@10 for GPT-4: ~95%HumanEval limitations:
- Only 164 problems — small, high variance
- Problems are mostly algorithmic puzzles, not real-world code
- Unit tests can be gamed (hardcoding known test cases)
- Doesn't test debugging, refactoring, or large codebase navigation
GSM8K: Grade School Math
What it tests: Multi-step arithmetic word problems at grade-school level
GSM8K_EXAMPLE = {
"question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did she sell altogether in April and May?",
"answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
}
def evaluate_gsm8k(model, problems: list[dict]) -> dict:
"""Evaluate on GSM8K using chain-of-thought prompting."""
COT_EXAMPLES = [
{
"question": "There are 15 trees in the grove...",
"reasoning": "Let's think step by step...",
"answer": "6",
}
]
correct = 0
for problem in problems:
prompt = build_cot_prompt(COT_EXAMPLES, problem["question"])
response = model.generate(prompt)
# Extract final answer after "####"
if "####" in response:
predicted = response.split("####")[-1].strip()
else:
predicted = extract_number(response)
if predicted == problem["answer"].split("####")[-1].strip():
correct += 1
return {
"accuracy": correct / len(problems),
"n_problems": len(problems),
}GSM8K limitations:
- 8.5K problems — small
- Grade school level: doesn't test actual mathematical reasoning
- Models can often memorize the training set patterns
- Pass rate above 90% (modern models) makes it no longer discriminative
MATH: Competition Mathematics
Much harder than GSM8K — AMC, AIME, and competition problems across 7 categories:
MATH_CATEGORIES = {
"Prealgebra": "Arithmetic, ratios, basic algebra",
"Algebra": "Polynomials, equations, inequalities",
"Number Theory": "Divisibility, modular arithmetic, primes",
"Counting & Probability": "Combinatorics, expected value",
"Geometry": "Euclidean and coordinate geometry",
"Intermediate Algebra": "Complex numbers, sequences, logarithms",
"Precalculus": "Trigonometry, vectors, matrices",
}
# Difficulty levels 1-5
# GPT-4 pass@1: ~69% (level 1-3 problems)
# GPT-4 pass@1: ~38% (level 4-5 problems)
# State-of-art (2025): ~90%+ with verification
def extract_boxed_answer(response: str) -> str:
"""Extract LaTeX boxed answer from model response."""
import re
# Answers in MATH dataset are in \boxed{} format
match = re.search(r'\\boxed\{([^}]+)\}', response)
return match.group(1) if match else ""HellaSwag and WinoGrande
HellaSwag: Commonsense reasoning — complete a sentence from a Wikipedia video description:
Prompt: "A woman is shown cutting and styling a man's hair. The man watches as..."
A: "...his hair falls to the floor."
B: "...the stylist hands him some food."
C: "...a child walks by."
D: "...someone opens the door."Random baseline 25%; GPT-4: 95%; human performance: 95%. Now saturated.
WinoGrande: Pronoun resolution requiring world knowledge:
"The trophy didn't fit in the suitcase because __ was too big."
A: the trophy B: the suitcaseThese tests were designed to require "real" commonsense — they're now solved by large models.
LMSYS Chatbot Arena: Human Preference Evaluation
The most robust benchmark because it uses blind human preference, not automated scoring:
# Chatbot Arena approach (not reproducible programmatically — it's a live platform)
# Two models are shown side-by-side, anonymized
# Humans pick which response they prefer
# Elo rating system aggregates preferences
# Arena characteristics:
# - Self-selected evaluators (not representative of all users)
# - Users choose their own prompts (real use cases)
# - Blind evaluation prevents bias
# - Large scale (millions of votes)
# Limitations:
# - Users tend to prefer longer, more elaborate responses (verbosity bias)
# - English-language bias (most users write in English)
# - Sycophantic responses may score well with casual users
# - Complex technical questions underrepresented vs casual chatBuilding Domain-Specific Evaluations
For clinical AI, generic benchmarks are insufficient. Build targeted evals:
from dataclasses import dataclass
from typing import Callable
import json
@dataclass
class DomainEvalCase:
question: str
expected_answer: str
category: str # e.g., "drug_interaction", "dosing", "safety"
difficulty: int # 1-3
reference: str # Source (e.g., "Lexicomp 2024")
def create_clinical_eval_suite() -> list[DomainEvalCase]:
return [
DomainEvalCase(
question="What is the severity of the interaction between warfarin and clarithromycin?",
expected_answer="major",
category="drug_interaction",
difficulty=2,
reference="Lexicomp Drug Interactions",
),
DomainEvalCase(
question="What dose adjustment is needed for metformin at eGFR 30 mL/min/1.73m²?",
expected_answer="contraindicated",
category="renal_dosing",
difficulty=2,
reference="KDIGO CKD Guidelines 2022",
),
DomainEvalCase(
question="Warfarin and alcohol: is this a major interaction?",
expected_answer="moderate",
category="drug_interaction",
difficulty=1,
reference="Lexicomp Drug Interactions",
),
]
def run_domain_eval(
model_fn: Callable[[str], str],
eval_suite: list[DomainEvalCase],
judge_fn: Callable[[str, str], float] = None,
) -> dict:
"""Run domain evaluation with optional LLM judge scoring."""
results = {
"overall": {"correct": 0, "total": 0},
"by_category": {},
"by_difficulty": {1: {"correct": 0, "total": 0}, 2: {"correct": 0, "total": 0}, 3: {"correct": 0, "total": 0}},
}
for case in eval_suite:
response = model_fn(case.question)
if judge_fn:
score = judge_fn(response, case.expected_answer)
is_correct = score >= 0.8
else:
# Simple string match (case-insensitive, substring)
is_correct = case.expected_answer.lower() in response.lower()
results["overall"]["total"] += 1
results["overall"]["correct"] += is_correct
cat = case.category
if cat not in results["by_category"]:
results["by_category"][cat] = {"correct": 0, "total": 0}
results["by_category"][cat]["total"] += 1
results["by_category"][cat]["correct"] += is_correct
results["by_difficulty"][case.difficulty]["total"] += 1
results["by_difficulty"][case.difficulty]["correct"] += is_correct
# Compute accuracy rates
results["overall"]["accuracy"] = results["overall"]["correct"] / results["overall"]["total"]
for cat in results["by_category"]:
d = results["by_category"][cat]
d["accuracy"] = d["correct"] / d["total"]
return resultsBenchmark Contamination Detection
def check_benchmark_contamination(
training_dataset: list[str],
benchmark_questions: list[str],
threshold: float = 0.8,
) -> list[dict]:
"""
Check if benchmark questions appear in training data.
Uses n-gram overlap to detect contamination.
"""
from difflib import SequenceMatcher
contaminated = []
for bq in benchmark_questions:
for train_text in training_dataset:
ratio = SequenceMatcher(None, bq.lower(), train_text.lower()).ratio()
if ratio > threshold:
contaminated.append({
"benchmark_question": bq,
"training_text": train_text[:200],
"similarity": ratio,
})
break
contamination_rate = len(contaminated) / len(benchmark_questions)
print(f"Contamination rate: {contamination_rate:.1%}")
return contaminated
# Models with high contamination rates on a benchmark:
# - Score reflects memorization, not generalization
# - Solution: use held-out, newly created benchmarks
# - LiveBench (updated monthly), MMLU-Pro (harder version) attempt to address thisFound this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.