LLM Benchmarks: What They Measure and What They Don't

Why Benchmarks Are Hard

Every benchmark has a fundamental problem: once a benchmark is widely known, models are trained on it (intentionally or through data contamination), and scores inflate without reflecting real capability improvement. This is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."

Benchmark lifecycle:

Benchmark is created to measure capability X
Models improve on it through training on the benchmark distribution
A model scores 95% — the benchmark is "saturated"
Researchers create a harder benchmark to measure capability X

Understanding what each benchmark actually tests — and where it fails — is essential for interpreting LLM comparisons.

MMLU: Multitask Language Understanding

What it tests: Multiple-choice questions across 57 subjects (anatomy, law, physics, history, etc.)

Python

# MMLU example structure
MMLU_EXAMPLE = {
    "subject": "clinical_pharmacology",
    "question": "Warfarin's anticoagulant effect is primarily due to inhibition of:",
    "choices": [
        "A. Thrombin directly",
        "B. Vitamin K-dependent clotting factors",
        "C. Platelet aggregation",
        "D. Fibrinogen synthesis",
    ],
    "answer": "B",
}

# Common evaluation approach: 5-shot prompting
def evaluate_mmlu(model, examples: list[dict], n_shot: int = 5) -> float:
    correct = 0

    for item in examples:
        # Build few-shot prompt
        few_shot = "\n\n".join([
            f"Q: {ex['question']}\nA: {ex['answer']}"
            for ex in examples[:n_shot]
        ])
        prompt = f"{few_shot}\n\nQ: {item['question']}\nA:"

        response = model.generate(prompt, max_tokens=1)
        predicted = response.strip()[0]  # Just take the first letter
        correct += predicted == item["answer"]

    return correct / len(examples)

MMLU limitations:

Multiple-choice format rewards systematic guessing strategies
Some questions have disputed correct answers
Contamination: MMLU questions are widely available online
Doesn't test reasoning — a model can score 80% without understanding the material
Measures breadth, not depth — 50 questions per domain can't probe deep expertise

Typical scores: Random baseline 25%; GPT-4 86%; Claude 3.5 Sonnet 88%; Llama-3-70B 82%

HumanEval: Code Generation

What it tests: Write a Python function given docstring + signature; tests run automatically.

Python

# HumanEval example
HUMANEVAL_EXAMPLE = {
    "task_id": "HumanEval/7",
    "prompt": """def filter_by_substring(strings: List[str], substring: str) -> List[str]:
    \"\"\" Filter an input list of strings only for ones that contain given substring
    >>> filter_by_substring([], 'a')
    []
    >>> filter_by_substring(['abc', 'bacd', 'cde', 'array'], 'a')
    ['abc', 'bacd', 'array']
    \"\"\"
""",
    "canonical_solution": "    return [x for x in strings if substring in x]\n",
    "test": """
def check(filter_by_substring):
    assert filter_by_substring([], 'a') == []
    assert filter_by_substring(['abc', 'bacd', 'cde', 'array'], 'a') == ['abc', 'bacd', 'array']
""",
    "entry_point": "filter_by_substring",
}

# pass@k metric: generate k samples, problem passes if at least 1 is correct
def pass_at_k(n_correct: int, n_samples: int, k: int) -> float:
    """
    Unbiased estimate of pass@k.
    n_correct: number of correct samples out of n_samples.
    """
    if n_correct == 0:
        return 0.0
    if n_correct == n_samples:
        return 1.0

    # Complement: probability that all k samples are wrong
    from math import comb
    p_all_wrong = comb(n_samples - n_correct, k) / comb(n_samples, k)
    return 1 - p_all_wrong

# pass@1 for GPT-4: ~87%
# pass@10 for GPT-4: ~95%

HumanEval limitations:

Only 164 problems — small, high variance
Problems are mostly algorithmic puzzles, not real-world code
Unit tests can be gamed (hardcoding known test cases)
Doesn't test debugging, refactoring, or large codebase navigation

GSM8K: Grade School Math

What it tests: Multi-step arithmetic word problems at grade-school level

Python

GSM8K_EXAMPLE = {
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did she sell altogether in April and May?",
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
}

def evaluate_gsm8k(model, problems: list[dict]) -> dict:
    """Evaluate on GSM8K using chain-of-thought prompting."""
    COT_EXAMPLES = [
        {
            "question": "There are 15 trees in the grove...",
            "reasoning": "Let's think step by step...",
            "answer": "6",
        }
    ]

    correct = 0
    for problem in problems:
        prompt = build_cot_prompt(COT_EXAMPLES, problem["question"])
        response = model.generate(prompt)

        # Extract final answer after "####"
        if "####" in response:
            predicted = response.split("####")[-1].strip()
        else:
            predicted = extract_number(response)

        if predicted == problem["answer"].split("####")[-1].strip():
            correct += 1

    return {
        "accuracy": correct / len(problems),
        "n_problems": len(problems),
    }

GSM8K limitations:

8.5K problems — small
Grade school level: doesn't test actual mathematical reasoning
Models can often memorize the training set patterns
Pass rate above 90% (modern models) makes it no longer discriminative

MATH: Competition Mathematics

Much harder than GSM8K — AMC, AIME, and competition problems across 7 categories:

Python

MATH_CATEGORIES = {
    "Prealgebra": "Arithmetic, ratios, basic algebra",
    "Algebra": "Polynomials, equations, inequalities",
    "Number Theory": "Divisibility, modular arithmetic, primes",
    "Counting & Probability": "Combinatorics, expected value",
    "Geometry": "Euclidean and coordinate geometry",
    "Intermediate Algebra": "Complex numbers, sequences, logarithms",
    "Precalculus": "Trigonometry, vectors, matrices",
}

# Difficulty levels 1-5
# GPT-4 pass@1: ~69% (level 1-3 problems)
# GPT-4 pass@1: ~38% (level 4-5 problems)
# State-of-art (2025): ~90%+ with verification

def extract_boxed_answer(response: str) -> str:
    """Extract LaTeX boxed answer from model response."""
    import re
    # Answers in MATH dataset are in \boxed{} format
    match = re.search(r'\\boxed\{([^}]+)\}', response)
    return match.group(1) if match else ""

HellaSwag and WinoGrande

HellaSwag: Commonsense reasoning — complete a sentence from a Wikipedia video description:

Prompt: "A woman is shown cutting and styling a man's hair. The man watches as..."
A: "...his hair falls to the floor."
B: "...the stylist hands him some food."
C: "...a child walks by."
D: "...someone opens the door."

Random baseline 25%; GPT-4: 95%; human performance: 95%. Now saturated.

WinoGrande: Pronoun resolution requiring world knowledge:

"The trophy didn't fit in the suitcase because __ was too big."
A: the trophy  B: the suitcase

These tests were designed to require "real" commonsense — they're now solved by large models.

LMSYS Chatbot Arena: Human Preference Evaluation

The most robust benchmark because it uses blind human preference, not automated scoring:

Python

# Chatbot Arena approach (not reproducible programmatically — it's a live platform)
# Two models are shown side-by-side, anonymized
# Humans pick which response they prefer
# Elo rating system aggregates preferences

# Arena characteristics:
# - Self-selected evaluators (not representative of all users)
# - Users choose their own prompts (real use cases)
# - Blind evaluation prevents bias
# - Large scale (millions of votes)

# Limitations:
# - Users tend to prefer longer, more elaborate responses (verbosity bias)
# - English-language bias (most users write in English)
# - Sycophantic responses may score well with casual users
# - Complex technical questions underrepresented vs casual chat

Building Domain-Specific Evaluations

For clinical AI, generic benchmarks are insufficient. Build targeted evals:

Python

from dataclasses import dataclass
from typing import Callable
import json

@dataclass
class DomainEvalCase:
    question: str
    expected_answer: str
    category: str          # e.g., "drug_interaction", "dosing", "safety"
    difficulty: int        # 1-3
    reference: str         # Source (e.g., "Lexicomp 2024")

def create_clinical_eval_suite() -> list[DomainEvalCase]:
    return [
        DomainEvalCase(
            question="What is the severity of the interaction between warfarin and clarithromycin?",
            expected_answer="major",
            category="drug_interaction",
            difficulty=2,
            reference="Lexicomp Drug Interactions",
        ),
        DomainEvalCase(
            question="What dose adjustment is needed for metformin at eGFR 30 mL/min/1.73m²?",
            expected_answer="contraindicated",
            category="renal_dosing",
            difficulty=2,
            reference="KDIGO CKD Guidelines 2022",
        ),
        DomainEvalCase(
            question="Warfarin and alcohol: is this a major interaction?",
            expected_answer="moderate",
            category="drug_interaction",
            difficulty=1,
            reference="Lexicomp Drug Interactions",
        ),
    ]

def run_domain_eval(
    model_fn: Callable[[str], str],
    eval_suite: list[DomainEvalCase],
    judge_fn: Callable[[str, str], float] = None,
) -> dict:
    """Run domain evaluation with optional LLM judge scoring."""
    results = {
        "overall": {"correct": 0, "total": 0},
        "by_category": {},
        "by_difficulty": {1: {"correct": 0, "total": 0}, 2: {"correct": 0, "total": 0}, 3: {"correct": 0, "total": 0}},
    }

    for case in eval_suite:
        response = model_fn(case.question)

        if judge_fn:
            score = judge_fn(response, case.expected_answer)
            is_correct = score >= 0.8
        else:
            # Simple string match (case-insensitive, substring)
            is_correct = case.expected_answer.lower() in response.lower()

        results["overall"]["total"] += 1
        results["overall"]["correct"] += is_correct

        cat = case.category
        if cat not in results["by_category"]:
            results["by_category"][cat] = {"correct": 0, "total": 0}
        results["by_category"][cat]["total"] += 1
        results["by_category"][cat]["correct"] += is_correct

        results["by_difficulty"][case.difficulty]["total"] += 1
        results["by_difficulty"][case.difficulty]["correct"] += is_correct

    # Compute accuracy rates
    results["overall"]["accuracy"] = results["overall"]["correct"] / results["overall"]["total"]
    for cat in results["by_category"]:
        d = results["by_category"][cat]
        d["accuracy"] = d["correct"] / d["total"]

    return results

Benchmark Contamination Detection

Python

def check_benchmark_contamination(
    training_dataset: list[str],
    benchmark_questions: list[str],
    threshold: float = 0.8,
) -> list[dict]:
    """
    Check if benchmark questions appear in training data.
    Uses n-gram overlap to detect contamination.
    """
    from difflib import SequenceMatcher

    contaminated = []

    for bq in benchmark_questions:
        for train_text in training_dataset:
            ratio = SequenceMatcher(None, bq.lower(), train_text.lower()).ratio()
            if ratio > threshold:
                contaminated.append({
                    "benchmark_question": bq,
                    "training_text": train_text[:200],
                    "similarity": ratio,
                })
                break

    contamination_rate = len(contaminated) / len(benchmark_questions)
    print(f"Contamination rate: {contamination_rate:.1%}")
    return contaminated

# Models with high contamination rates on a benchmark:
# - Score reflects memorization, not generalization
# - Solution: use held-out, newly created benchmarks
# - LiveBench (updated monthly), MMLU-Pro (harder version) attempt to address this

LLM Benchmarks: What They Measure and What They Don't

Why Benchmarks Are Hard

MMLU: Multitask Language Understanding

HumanEval: Code Generation

GSM8K: Grade School Math

MATH: Competition Mathematics

HellaSwag and WinoGrande

LMSYS Chatbot Arena: Human Preference Evaluation

Building Domain-Specific Evaluations

Benchmark Contamination Detection

Enjoyed this article?

Leave a comment