Popular Benchmarks: MMLU, HumanEval, HELM — LLMs Deep Dive | Learnixo

Core Benchmarks

MMLU (Massive Multitask Language Understanding):
  57 subjects: math, science, medicine, law, history, etc.
  4-choice multiple choice
  14,042 questions
  Measures: knowledge breadth
  Random baseline: 25%
  Human expert: ~89%
  LLaMA 2 70B: 68.9%  |  GPT-4: 86.4%

HumanEval (Code):
  164 Python programming problems
  Model must write code that passes unit tests
  Measures: functional code generation
  LLaMA 2 70B: 29.9%  |  GPT-4: 67.0%  |  Claude 3 Sonnet: 73.0%

GSM8K (Grade School Math):
  8,500 math word problems requiring multi-step arithmetic reasoning
  Chain-of-thought is almost required
  LLaMA 2 70B: 56.8%  |  GPT-4: 92.0%

HellaSwag:
  10K sentence completion tasks requiring commonsense reasoning
  Models must pick the most likely continuation
  LLaMA 2 70B: 87.1%  |  GPT-4: ~95%
  Human: 95.6%

Medical and Domain Benchmarks

MedQA (USMLE-style):
  Questions based on US medical licensing exam style
  LLaMA 2 70B: ~65%  |  GPT-4: ~87%  |  Med-PaLM 2: ~86%

MedMCQA:
  194K multiple-choice medical questions
  Tests clinical knowledge specifically

PubMedQA:
  500 biomedical research questions requiring yes/no/maybe + justification
  Tests biomedical reading comprehension

ClinicalBench (emerging):
  Questions derived from real clinical note understanding
  More realistic than USMLE-style

Reasoning and Alignment Benchmarks

TruthfulQA:
  817 questions designed to elicit false beliefs (health, law, finance)
  Measures: truthfulness, not just accuracy
  GPT-4: ~59% truthful  |  LLaMA 2 70B: ~64%
  (Higher truthfulness ≠ more capable — more capable models
   can be less truthful if not aligned)

BIG-bench Hard (BBH):
  23 challenging reasoning tasks from BIG-bench
  Tests reasoning beyond pattern-matching
  LLaMA 2 70B: 41.9%  |  GPT-4: 83.1%

ARC (AI2 Reasoning Challenge):
  Science questions at Grade 3-9 level (Easy and Challenge sets)
  Challenge set requires inference, not just retrieval

WinoGrande:
  44K Winograd schema-style commonsense problems
  Requires pronoun disambiguation using world knowledge

Interpreting Benchmarks

Common issues with benchmark leaderboards:

1. Data contamination:
   LLMs trained on internet data may have seen benchmark questions.
   MMLU questions appear in many datasets — models may memorise answers.
   Reported scores can be inflated vs. true generalisation.

2. Prompt sensitivity:
   "Answer with A, B, C, or D." vs "The answer is: ___" can change
   MMLU scores by 5-10 points.
   Reported scores depend heavily on evaluation harness details.

3. Few-shot vs zero-shot:
   5-shot MMLU vs 0-shot MMLU differ by 3-7 points.
   Always check the evaluation setup.

4. Benchmark saturation:
   GPT-4 is near human on HellaSwag and several MMLU subjects.
   New harder benchmarks are needed (MMLU-Pro, GPQA, FrontierMath).

5. Goodhart's Law:
   Models are increasingly trained to optimise benchmarks specifically.
   Benchmark score ≠ general capability.

MMLU-Pro and GPQA

Harder replacements for MMLU:

MMLU-Pro:
  12,032 questions across 14 domains
  10-choice instead of 4-choice (harder to guess)
  Questions require deeper reasoning
  GPT-4: 72.6%  |  LLaMA 3 70B: 62.9%

GPQA (Graduate-Level Google-Proof Q&A):
  448 expert-level questions in biology, chemistry, physics
  PhD students answer correctly ~65% of time
  Domain experts: 74%
  GPT-4: 53%  |  Claude 3 Opus: 50%
  Measures: genuine expert knowledge vs. retrieval

Evaluation Harnesses

Python

# Using lm-evaluation-harness (EleutherAI)
# pip install lm-eval

# Command line:
# lm_eval --model hf \
#         --model_args pretrained=meta-llama/Llama-2-7b-hf \
#         --tasks mmlu,gsm8k,hellaswag \
#         --num_fewshot 5 \
#         --batch_size 8 \
#         --output_path results/llama-7b

# Python:
from lm_eval import simple_evaluate
results = simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-2-7b-hf",
    tasks=["mmlu", "hellaswag"],
    num_fewshot=5,
)

Interview Answer

"Key LLM benchmarks: MMLU (57-subject knowledge, 4-choice, human expert ~89%); HumanEval (Python code generation, pass@1); GSM8K (multi-step math word problems); HellaSwag (commonsense sentence completion); TruthfulQA (elicits false beliefs — measures truthfulness). Medical: MedQA (USMLE-style), MedMCQA, PubMedQA. Interpreting benchmark claims requires caution: data contamination, prompt format sensitivity, few-shot vs zero-shot differences, and Goodhart's Law — models trained specifically on benchmark tasks inflate scores without genuine improvement. MMLU-Pro and GPQA are harder replacements that resist saturation better."