LLM Benchmarks
The key benchmarks used to evaluate LLMs — MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA — what they test, their limitations, and how to interpret leaderboard claims.
Core Benchmarks
MMLU (Massive Multitask Language Understanding):
57 subjects: math, science, medicine, law, history, etc.
4-choice multiple choice
14,042 questions
Measures: knowledge breadth
Random baseline: 25%
Human expert: ~89%
LLaMA 2 70B: 68.9% | GPT-4: 86.4%
HumanEval (Code):
164 Python programming problems
Model must write code that passes unit tests
Measures: functional code generation
LLaMA 2 70B: 29.9% | GPT-4: 67.0% | Claude 3 Sonnet: 73.0%
GSM8K (Grade School Math):
8,500 math word problems requiring multi-step arithmetic reasoning
Chain-of-thought is almost required
LLaMA 2 70B: 56.8% | GPT-4: 92.0%
HellaSwag:
10K sentence completion tasks requiring commonsense reasoning
Models must pick the most likely continuation
LLaMA 2 70B: 87.1% | GPT-4: ~95%
Human: 95.6%Medical and Domain Benchmarks
MedQA (USMLE-style):
Questions based on US medical licensing exam style
LLaMA 2 70B: ~65% | GPT-4: ~87% | Med-PaLM 2: ~86%
MedMCQA:
194K multiple-choice medical questions
Tests clinical knowledge specifically
PubMedQA:
500 biomedical research questions requiring yes/no/maybe + justification
Tests biomedical reading comprehension
ClinicalBench (emerging):
Questions derived from real clinical note understanding
More realistic than USMLE-styleReasoning and Alignment Benchmarks
TruthfulQA:
817 questions designed to elicit false beliefs (health, law, finance)
Measures: truthfulness, not just accuracy
GPT-4: ~59% truthful | LLaMA 2 70B: ~64%
(Higher truthfulness ≠ more capable — more capable models
can be less truthful if not aligned)
BIG-bench Hard (BBH):
23 challenging reasoning tasks from BIG-bench
Tests reasoning beyond pattern-matching
LLaMA 2 70B: 41.9% | GPT-4: 83.1%
ARC (AI2 Reasoning Challenge):
Science questions at Grade 3-9 level (Easy and Challenge sets)
Challenge set requires inference, not just retrieval
WinoGrande:
44K Winograd schema-style commonsense problems
Requires pronoun disambiguation using world knowledgeInterpreting Benchmarks
Common issues with benchmark leaderboards:
1. Data contamination:
LLMs trained on internet data may have seen benchmark questions.
MMLU questions appear in many datasets — models may memorise answers.
Reported scores can be inflated vs. true generalisation.
2. Prompt sensitivity:
"Answer with A, B, C, or D." vs "The answer is: ___" can change
MMLU scores by 5-10 points.
Reported scores depend heavily on evaluation harness details.
3. Few-shot vs zero-shot:
5-shot MMLU vs 0-shot MMLU differ by 3-7 points.
Always check the evaluation setup.
4. Benchmark saturation:
GPT-4 is near human on HellaSwag and several MMLU subjects.
New harder benchmarks are needed (MMLU-Pro, GPQA, FrontierMath).
5. Goodhart's Law:
Models are increasingly trained to optimise benchmarks specifically.
Benchmark score ≠ general capability.MMLU-Pro and GPQA
Harder replacements for MMLU:
MMLU-Pro:
12,032 questions across 14 domains
10-choice instead of 4-choice (harder to guess)
Questions require deeper reasoning
GPT-4: 72.6% | LLaMA 3 70B: 62.9%
GPQA (Graduate-Level Google-Proof Q&A):
448 expert-level questions in biology, chemistry, physics
PhD students answer correctly ~65% of time
Domain experts: 74%
GPT-4: 53% | Claude 3 Opus: 50%
Measures: genuine expert knowledge vs. retrievalEvaluation Harnesses
# Using lm-evaluation-harness (EleutherAI)
# pip install lm-eval
# Command line:
# lm_eval --model hf \
# --model_args pretrained=meta-llama/Llama-2-7b-hf \
# --tasks mmlu,gsm8k,hellaswag \
# --num_fewshot 5 \
# --batch_size 8 \
# --output_path results/llama-7b
# Python:
from lm_eval import simple_evaluate
results = simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-2-7b-hf",
tasks=["mmlu", "hellaswag"],
num_fewshot=5,
)Interview Answer
"Key LLM benchmarks: MMLU (57-subject knowledge, 4-choice, human expert ~89%); HumanEval (Python code generation, pass@1); GSM8K (multi-step math word problems); HellaSwag (commonsense sentence completion); TruthfulQA (elicits false beliefs — measures truthfulness). Medical: MedQA (USMLE-style), MedMCQA, PubMedQA. Interpreting benchmark claims requires caution: data contamination, prompt format sensitivity, few-shot vs zero-shot differences, and Goodhart's Law — models trained specifically on benchmark tasks inflate scores without genuine improvement. MMLU-Pro and GPQA are harder replacements that resist saturation better."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.