Learnixo
Back to blog
AI Systemsintermediate

Popular LLM Benchmarks Explained

Understand MMLU, HellaSwag, HumanEval, MT-Bench, Chatbot Arena, and other standard benchmarks. Learn what each measures and how to use them for model selection.

Asma Hafeez KhanMay 16, 20265 min read
EvaluationBenchmarksMMLUHumanEvalPython
Share:𝕏

Why Benchmarks Matter for Model Selection

When choosing a base model or evaluating fine-tuning results, standardized benchmarks give you an objective comparison point. Without them, you're comparing models on subjective impressions.

The key insight: no single benchmark captures everything. Different benchmarks measure different capabilities — use multiple to get a complete picture.


Knowledge and Reasoning Benchmarks

MMLU (Massive Multitask Language Understanding)

What it measures: Academic knowledge across 57 subjects: medicine, law, mathematics, history, computer science, and more. Multiple-choice format.

Format: 4-choice questions, 5-shot by default.

Why it matters: Good proxy for retained knowledge. If MMLU drops after fine-tuning, the model has forgotten general knowledge.

Scores: GPT-4o: ~88%, Llama-3.1-70B: ~86%, Llama-3.1-8B: ~73%

ARC (AI2 Reasoning Challenge)

What it measures: Science questions at Grade 3–9 level. ARC-Easy and ARC-Challenge (harder questions that simple retrieval methods fail on).

Why it matters: Tests genuine reasoning rather than memorization.

HellaSwag

What it measures: Commonsense reasoning. Given the beginning of a scenario, pick the most plausible continuation.

Why it matters: Tests whether the model understands how the physical and social world works.

TruthfulQA

What it measures: Whether models give truthful answers to questions that are often answered incorrectly due to misconceptions or hallucination tendencies.

Why it matters: Directly tests hallucination tendency. Models with RLHF alignment generally score higher.


Code and Math Benchmarks

HumanEval

What it measures: Python coding ability. 164 programming problems with unit tests. Model generates code; tests determine correctness.

Metric: pass@1 (fraction of problems solved on first attempt), pass@k (solved in k attempts).

Scores: GPT-4o: ~90%, Claude 3.5 Sonnet: ~92%, Llama-3.1-8B: ~72%

Python
# HumanEval format
problem = {
    "prompt": "def add(a: int, b: int) -> int:\n    \"\"\"Return a + b.\"\"\"\n",
    "test": "assert add(1, 2) == 3\nassert add(-1, 1) == 0",
    "entry_point": "add",
}

GSM8K (Grade School Math)

What it measures: Grade school math word problems requiring multi-step reasoning.

Why it matters: Tests arithmetic reasoning and multi-step problem decomposition.

MATH

Harder than GSM8K — competition math problems. Tests whether models can handle complex symbolic reasoning.


Instruction Following Benchmarks

MT-Bench

What it measures: Multi-turn conversational ability across 8 categories: writing, roleplay, extraction, reasoning, math, coding, STEM, humanities. Scored by GPT-4 as judge.

Format: 80 multi-turn questions. GPT-4 scores each response 1–10.

Why it matters: Tests real-world usefulness, not just academic knowledge.

AlpacaEval

What it measures: Single-turn instruction following. Compares model responses against GPT-4 reference outputs using an LLM judge.

Metric: Win rate vs GPT-4 reference.


Human Preference Benchmarks

Chatbot Arena (LMSYS)

What it measures: Direct human preference via blind pairwise comparisons. Users chat with two anonymous models and vote for the better response.

Why it matters: Gold standard for real-world quality — actual users, real conversations, blind evaluation. The Elo rating from Chatbot Arena is the most trusted overall quality signal.

Limitation: Slow to update, focuses on general conversation rather than specific domains.


Domain-Specific Benchmarks

For specialized applications, standard benchmarks may not reflect your use case. Examples:

| Domain | Benchmark | What it tests | |---|---|---| | Medical | MedQA (USMLE) | Clinical medicine MCQs | | Medical | MedMCQA | Indian medical entrance questions | | Legal | LegalBench | Legal reasoning and analysis | | Code | SWE-bench | Real GitHub issues (harder than HumanEval) | | Safety | BBQ | Bias in question answering | | Long context | SCROLLS | Long document comprehension |


Running Benchmarks with lm-evaluation-harness

Bash
pip install lm-eval
Python
from lm_eval import evaluator

# Evaluate a model on multiple benchmarks
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16",
    tasks=["mmlu", "hellaswag", "arc_challenge", "truthfulqa_mc1"],
    num_fewshot=5,
    batch_size=8,
    device="cuda",
    output_path="./benchmark_results",
)

# Print results
for task, task_results in results["results"].items():
    for metric, value in task_results.items():
        if isinstance(value, float):
            print(f"{task} | {metric}: {value:.4f}")

Benchmark Limitations

Contamination: Many models have seen benchmark data during pre-training. MMLU scores may reflect memorization, not reasoning. Newer, held-out benchmarks (like GPQA) reduce contamination.

Distribution mismatch: Benchmark performance doesn't guarantee performance on your specific use case. A model that scores high on MMLU medical questions may still fail on specialized pharmaceutical interaction queries.

Saturation: Frontier models score 85–90%+ on MMLU. The benchmark no longer meaningfully differentiates top models. Newer benchmarks like MMLU-Pro and GPQA are designed to be harder.

Single-shot vs real use: Benchmarks test isolated questions. Real applications involve multi-turn context, system prompts, tools, and domain-specific formats. Always supplement benchmark evaluation with real task evaluation.


How to Use Benchmarks in Practice

  1. Base model selection: Use Chatbot Arena Elo + MMLU + task-specific benchmark to choose between Llama, Mistral, Gemma variants
  2. Post-fine-tuning check: Run MMLU before and after — accept less than 3% regression
  3. Domain validation: Build a custom MCQ benchmark from your domain (100–200 questions) — this is more informative than MMLU for your use case
  4. Monthly tracking: Run your benchmark suite monthly on the deployed model version to catch degradation from model updates

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.