Popular LLM Benchmarks Explained
Understand MMLU, HellaSwag, HumanEval, MT-Bench, Chatbot Arena, and other standard benchmarks. Learn what each measures and how to use them for model selection.
Why Benchmarks Matter for Model Selection
When choosing a base model or evaluating fine-tuning results, standardized benchmarks give you an objective comparison point. Without them, you're comparing models on subjective impressions.
The key insight: no single benchmark captures everything. Different benchmarks measure different capabilities — use multiple to get a complete picture.
Knowledge and Reasoning Benchmarks
MMLU (Massive Multitask Language Understanding)
What it measures: Academic knowledge across 57 subjects: medicine, law, mathematics, history, computer science, and more. Multiple-choice format.
Format: 4-choice questions, 5-shot by default.
Why it matters: Good proxy for retained knowledge. If MMLU drops after fine-tuning, the model has forgotten general knowledge.
Scores: GPT-4o: ~88%, Llama-3.1-70B: ~86%, Llama-3.1-8B: ~73%
ARC (AI2 Reasoning Challenge)
What it measures: Science questions at Grade 3–9 level. ARC-Easy and ARC-Challenge (harder questions that simple retrieval methods fail on).
Why it matters: Tests genuine reasoning rather than memorization.
HellaSwag
What it measures: Commonsense reasoning. Given the beginning of a scenario, pick the most plausible continuation.
Why it matters: Tests whether the model understands how the physical and social world works.
TruthfulQA
What it measures: Whether models give truthful answers to questions that are often answered incorrectly due to misconceptions or hallucination tendencies.
Why it matters: Directly tests hallucination tendency. Models with RLHF alignment generally score higher.
Code and Math Benchmarks
HumanEval
What it measures: Python coding ability. 164 programming problems with unit tests. Model generates code; tests determine correctness.
Metric: pass@1 (fraction of problems solved on first attempt), pass@k (solved in k attempts).
Scores: GPT-4o: ~90%, Claude 3.5 Sonnet: ~92%, Llama-3.1-8B: ~72%
# HumanEval format
problem = {
"prompt": "def add(a: int, b: int) -> int:\n \"\"\"Return a + b.\"\"\"\n",
"test": "assert add(1, 2) == 3\nassert add(-1, 1) == 0",
"entry_point": "add",
}GSM8K (Grade School Math)
What it measures: Grade school math word problems requiring multi-step reasoning.
Why it matters: Tests arithmetic reasoning and multi-step problem decomposition.
MATH
Harder than GSM8K — competition math problems. Tests whether models can handle complex symbolic reasoning.
Instruction Following Benchmarks
MT-Bench
What it measures: Multi-turn conversational ability across 8 categories: writing, roleplay, extraction, reasoning, math, coding, STEM, humanities. Scored by GPT-4 as judge.
Format: 80 multi-turn questions. GPT-4 scores each response 1–10.
Why it matters: Tests real-world usefulness, not just academic knowledge.
AlpacaEval
What it measures: Single-turn instruction following. Compares model responses against GPT-4 reference outputs using an LLM judge.
Metric: Win rate vs GPT-4 reference.
Human Preference Benchmarks
Chatbot Arena (LMSYS)
What it measures: Direct human preference via blind pairwise comparisons. Users chat with two anonymous models and vote for the better response.
Why it matters: Gold standard for real-world quality — actual users, real conversations, blind evaluation. The Elo rating from Chatbot Arena is the most trusted overall quality signal.
Limitation: Slow to update, focuses on general conversation rather than specific domains.
Domain-Specific Benchmarks
For specialized applications, standard benchmarks may not reflect your use case. Examples:
| Domain | Benchmark | What it tests | |---|---|---| | Medical | MedQA (USMLE) | Clinical medicine MCQs | | Medical | MedMCQA | Indian medical entrance questions | | Legal | LegalBench | Legal reasoning and analysis | | Code | SWE-bench | Real GitHub issues (harder than HumanEval) | | Safety | BBQ | Bias in question answering | | Long context | SCROLLS | Long document comprehension |
Running Benchmarks with lm-evaluation-harness
pip install lm-evalfrom lm_eval import evaluator
# Evaluate a model on multiple benchmarks
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16",
tasks=["mmlu", "hellaswag", "arc_challenge", "truthfulqa_mc1"],
num_fewshot=5,
batch_size=8,
device="cuda",
output_path="./benchmark_results",
)
# Print results
for task, task_results in results["results"].items():
for metric, value in task_results.items():
if isinstance(value, float):
print(f"{task} | {metric}: {value:.4f}")Benchmark Limitations
Contamination: Many models have seen benchmark data during pre-training. MMLU scores may reflect memorization, not reasoning. Newer, held-out benchmarks (like GPQA) reduce contamination.
Distribution mismatch: Benchmark performance doesn't guarantee performance on your specific use case. A model that scores high on MMLU medical questions may still fail on specialized pharmaceutical interaction queries.
Saturation: Frontier models score 85–90%+ on MMLU. The benchmark no longer meaningfully differentiates top models. Newer benchmarks like MMLU-Pro and GPQA are designed to be harder.
Single-shot vs real use: Benchmarks test isolated questions. Real applications involve multi-turn context, system prompts, tools, and domain-specific formats. Always supplement benchmark evaluation with real task evaluation.
How to Use Benchmarks in Practice
- Base model selection: Use Chatbot Arena Elo + MMLU + task-specific benchmark to choose between Llama, Mistral, Gemma variants
- Post-fine-tuning check: Run MMLU before and after — accept less than 3% regression
- Domain validation: Build a custom MCQ benchmark from your domain (100–200 questions) — this is more informative than MMLU for your use case
- Monthly tracking: Run your benchmark suite monthly on the deployed model version to catch degradation from model updates
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.