Learnixo

LLM Evaluation Q&A · Lesson 13 of 16

Popular Benchmarks: MMLU, HumanEval, TruthfulQA

Why Benchmarks Matter for Model Selection

When choosing a base model or evaluating fine-tuning results, standardized benchmarks give you an objective comparison point. Without them, you're comparing models on subjective impressions.

The key insight: no single benchmark captures everything. Different benchmarks measure different capabilities — use multiple to get a complete picture.


Knowledge and Reasoning Benchmarks

MMLU (Massive Multitask Language Understanding)

What it measures: Academic knowledge across 57 subjects: medicine, law, mathematics, history, computer science, and more. Multiple-choice format.

Format: 4-choice questions, 5-shot by default.

Why it matters: Good proxy for retained knowledge. If MMLU drops after fine-tuning, the model has forgotten general knowledge.

Scores: GPT-4o: ~88%, Llama-3.1-70B: ~86%, Llama-3.1-8B: ~73%

ARC (AI2 Reasoning Challenge)

What it measures: Science questions at Grade 3–9 level. ARC-Easy and ARC-Challenge (harder questions that simple retrieval methods fail on).

Why it matters: Tests genuine reasoning rather than memorization.

HellaSwag

What it measures: Commonsense reasoning. Given the beginning of a scenario, pick the most plausible continuation.

Why it matters: Tests whether the model understands how the physical and social world works.

TruthfulQA

What it measures: Whether models give truthful answers to questions that are often answered incorrectly due to misconceptions or hallucination tendencies.

Why it matters: Directly tests hallucination tendency. Models with RLHF alignment generally score higher.


Code and Math Benchmarks

HumanEval

What it measures: Python coding ability. 164 programming problems with unit tests. Model generates code; tests determine correctness.

Metric: pass@1 (fraction of problems solved on first attempt), pass@k (solved in k attempts).

Scores: GPT-4o: ~90%, Claude 3.5 Sonnet: ~92%, Llama-3.1-8B: ~72%

Python
# HumanEval format
problem = {
    "prompt": "def add(a: int, b: int) -> int:\n    \"\"\"Return a + b.\"\"\"\n",
    "test": "assert add(1, 2) == 3\nassert add(-1, 1) == 0",
    "entry_point": "add",
}

GSM8K (Grade School Math)

What it measures: Grade school math word problems requiring multi-step reasoning.

Why it matters: Tests arithmetic reasoning and multi-step problem decomposition.

MATH

Harder than GSM8K — competition math problems. Tests whether models can handle complex symbolic reasoning.


Instruction Following Benchmarks

MT-Bench

What it measures: Multi-turn conversational ability across 8 categories: writing, roleplay, extraction, reasoning, math, coding, STEM, humanities. Scored by GPT-4 as judge.

Format: 80 multi-turn questions. GPT-4 scores each response 1–10.

Why it matters: Tests real-world usefulness, not just academic knowledge.

AlpacaEval

What it measures: Single-turn instruction following. Compares model responses against GPT-4 reference outputs using an LLM judge.

Metric: Win rate vs GPT-4 reference.


Human Preference Benchmarks

Chatbot Arena (LMSYS)

What it measures: Direct human preference via blind pairwise comparisons. Users chat with two anonymous models and vote for the better response.

Why it matters: Gold standard for real-world quality — actual users, real conversations, blind evaluation. The Elo rating from Chatbot Arena is the most trusted overall quality signal.

Limitation: Slow to update, focuses on general conversation rather than specific domains.


Domain-Specific Benchmarks

For specialized applications, standard benchmarks may not reflect your use case. Examples:

| Domain | Benchmark | What it tests | |---|---|---| | Medical | MedQA (USMLE) | Clinical medicine MCQs | | Medical | MedMCQA | Indian medical entrance questions | | Legal | LegalBench | Legal reasoning and analysis | | Code | SWE-bench | Real GitHub issues (harder than HumanEval) | | Safety | BBQ | Bias in question answering | | Long context | SCROLLS | Long document comprehension |


Running Benchmarks with lm-evaluation-harness

Bash
pip install lm-eval
Python
from lm_eval import evaluator

# Evaluate a model on multiple benchmarks
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16",
    tasks=["mmlu", "hellaswag", "arc_challenge", "truthfulqa_mc1"],
    num_fewshot=5,
    batch_size=8,
    device="cuda",
    output_path="./benchmark_results",
)

# Print results
for task, task_results in results["results"].items():
    for metric, value in task_results.items():
        if isinstance(value, float):
            print(f"{task} | {metric}: {value:.4f}")

Benchmark Limitations

Contamination: Many models have seen benchmark data during pre-training. MMLU scores may reflect memorization, not reasoning. Newer, held-out benchmarks (like GPQA) reduce contamination.

Distribution mismatch: Benchmark performance doesn't guarantee performance on your specific use case. A model that scores high on MMLU medical questions may still fail on specialized pharmaceutical interaction queries.

Saturation: Frontier models score 85–90%+ on MMLU. The benchmark no longer meaningfully differentiates top models. Newer benchmarks like MMLU-Pro and GPQA are designed to be harder.

Single-shot vs real use: Benchmarks test isolated questions. Real applications involve multi-turn context, system prompts, tools, and domain-specific formats. Always supplement benchmark evaluation with real task evaluation.


How to Use Benchmarks in Practice

  1. Base model selection: Use Chatbot Arena Elo + MMLU + task-specific benchmark to choose between Llama, Mistral, Gemma variants
  2. Post-fine-tuning check: Run MMLU before and after — accept less than 3% regression
  3. Domain validation: Build a custom MCQ benchmark from your domain (100–200 questions) — this is more informative than MMLU for your use case
  4. Monthly tracking: Run your benchmark suite monthly on the deployed model version to catch degradation from model updates