Benchmarking Fine-Tuned Models

Why Benchmarks Matter

Internal evaluation on your training distribution tells you if your model learned the training data. Standard benchmarks tell you if fine-tuning damaged general capabilities.

A common failure mode: fine-tuning improves task performance by 20% but reduces MMLU score by 15% — the model traded general reasoning for narrow specialization.

Standard LLM Benchmarks

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 academic subjects (medicine, law, history, STEM). A good proxy for retained general knowledge after fine-tuning.

Python

from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./fine-tuned-model",
    tasks=["mmlu"],
    num_fewshot=5,
    device="cuda",
)

print(f"MMLU: {results['results']['mmlu']['acc']:.3f}")

Baseline: Most 7–8B instruction models score 0.60–0.65 on MMLU. If fine-tuning drops this by more than 3–5%, you may be catastrophically forgetting general knowledge.

HellaSwag

Tests commonsense reasoning. Useful for checking that fine-tuning didn't damage reasoning ability.

TruthfulQA

Tests whether the model gives truthful answers to questions that might tempt it to hallucinate. Critical for domain-specific models where factual accuracy matters.

MT-Bench

Conversational multi-turn benchmark. Tests instruction following, coding, math, and reasoning across 8 categories.

Using lm-evaluation-harness

The standard tool for benchmarking:

Bash

pip install lm-eval

Python

from lm_eval import evaluator, tasks

# Evaluate fine-tuned model on a suite of benchmarks
results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./fine-tuned-model,dtype=bfloat16",
    tasks=["mmlu", "hellaswag", "truthfulqa_mc1", "arc_challenge"],
    num_fewshot=0,  # Zero-shot
    batch_size=8,
    device="cuda",
)

# Print summary
for task_name, task_results in results["results"].items():
    for metric, value in task_results.items():
        if not metric.endswith("_stderr"):
            print(f"{task_name} | {metric}: {value:.4f}")

Run the same evaluation on the base model and your fine-tuned model. The difference is your fine-tuning delta.

Domain-Specific Benchmark Construction

Standard benchmarks measure general ability. Build a domain benchmark for your specific task:

Python

# Example: clinical pharmacology benchmark
CLINICAL_BENCHMARK = [
    {
        "id": "pharm_001",
        "category": "drug_interactions",
        "question": "Which of the following best describes the mechanism of the warfarin-aspirin interaction?",
        "choices": [
            "A. Aspirin inhibits warfarin metabolism via CYP2C9",
            "B. Aspirin inhibits platelet aggregation and displaces warfarin from protein binding",
            "C. Aspirin increases vitamin K availability, antagonizing warfarin",
            "D. Aspirin directly inhibits vitamin K epoxide reductase",
        ],
        "answer": "B",
    },
    {
        "id": "pharm_002",
        "category": "pharmacokinetics",
        "question": "A patient with eGFR 25 mL/min is prescribed metformin. What is the correct action?",
        "choices": [
            "A. Continue at normal dose",
            "B. Reduce dose by 50%",
            "C. Contraindicated — discontinue metformin",
            "D. Use with caution, monitor lactate monthly",
        ],
        "answer": "C",
    },
]

def evaluate_mcq_benchmark(model, tokenizer, benchmark: list[dict]) -> dict:
    """Evaluate model on multiple-choice benchmark."""
    correct = 0
    by_category = {}

    for item in benchmark:
        choices_text = "\n".join(item["choices"])
        prompt = f"{item['question']}\n\n{choices_text}\n\nAnswer:"

        response = generate_response(model, tokenizer, prompt, max_new_tokens=10)

        # Extract letter from response
        predicted = None
        for letter in ["A", "B", "C", "D"]:
            if letter in response[:20]:
                predicted = letter
                break

        is_correct = predicted == item["answer"]
        if is_correct:
            correct += 1

        cat = item.get("category", "unknown")
        if cat not in by_category:
            by_category[cat] = {"correct": 0, "total": 0}
        by_category[cat]["total"] += 1
        if is_correct:
            by_category[cat]["correct"] += 1

    total = len(benchmark)
    return {
        "overall_accuracy": correct / total,
        "total_questions": total,
        "by_category": {
            cat: {"accuracy": v["correct"]/v["total"], "n": v["total"]}
            for cat, v in by_category.items()
        },
    }

Benchmark-Driven Development

Use benchmarks to guide training decisions:

Iteration 1: Fine-tune r=8, epochs=3
    → Task accuracy: 0.72, MMLU: 0.61 (base: 0.63)
    → MMLU dropped 2% — acceptable

Iteration 2: Fine-tune r=16, epochs=3
    → Task accuracy: 0.78, MMLU: 0.59 (base: 0.63)
    → MMLU dropped 4% — too much forgetting

Iteration 3: Fine-tune r=8, epochs=2 (stop earlier)
    → Task accuracy: 0.74, MMLU: 0.62 (base: 0.63)
    → Best balance: +2% task, -1% MMLU

This is the core trade-off: more fine-tuning → better task performance but more forgetting. Benchmarks make this trade-off visible and quantifiable.

Comparing Fine-Tuned Models

Python

import pandas as pd

def compare_models(model_results: dict[str, dict]) -> pd.DataFrame:
    """Compare multiple fine-tuned model variants side by side."""
    rows = []
    for model_name, results in model_results.items():
        row = {"model": model_name}
        row.update(results)
        rows.append(row)

    df = pd.DataFrame(rows).set_index("model")
    return df

results = {
    "base_llama_8b": {"mmlu": 0.630, "hellaswag": 0.810, "task_accuracy": 0.61},
    "ft_r8_ep2": {"mmlu": 0.621, "hellaswag": 0.805, "task_accuracy": 0.74},
    "ft_r16_ep3": {"mmlu": 0.590, "hellaswag": 0.797, "task_accuracy": 0.78},
    "ft_r8_ep3": {"mmlu": 0.615, "hellaswag": 0.803, "task_accuracy": 0.76},
}

df = compare_models(results)
print(df.to_string())

When to Stop Iterating

Stop when:

Task benchmark accuracy meets your minimum threshold (e.g., 80% on domain MCQ)
General benchmark regression is within acceptable limits (less than 3% MMLU drop)
Marginal improvements from additional training rounds are under 1–2%
Human evaluation confirms real-world quality on representative queries

Don't optimize benchmarks indefinitely — they're proxies for real-world quality, not the goal itself.

Benchmarking Fine-Tuned Models

Why Benchmarks Matter

Standard LLM Benchmarks

MMLU (Massive Multitask Language Understanding)

HellaSwag

TruthfulQA

MT-Bench

Using lm-evaluation-harness

Domain-Specific Benchmark Construction

Benchmark-Driven Development

Comparing Fine-Tuned Models

When to Stop Iterating

Enjoyed this article?

Leave a comment