Benchmarking Fine-Tuned Models
Use standard benchmarks and domain-specific evals to measure fine-tuned model quality. Understand MMLU, HellaSwag, TruthfulQA, and how to build custom benchmark suites.
Why Benchmarks Matter
Internal evaluation on your training distribution tells you if your model learned the training data. Standard benchmarks tell you if fine-tuning damaged general capabilities.
A common failure mode: fine-tuning improves task performance by 20% but reduces MMLU score by 15% — the model traded general reasoning for narrow specialization.
Standard LLM Benchmarks
MMLU (Massive Multitask Language Understanding)
Tests knowledge across 57 academic subjects (medicine, law, history, STEM). A good proxy for retained general knowledge after fine-tuning.
from lm_eval import evaluator
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./fine-tuned-model",
tasks=["mmlu"],
num_fewshot=5,
device="cuda",
)
print(f"MMLU: {results['results']['mmlu']['acc']:.3f}")Baseline: Most 7–8B instruction models score 0.60–0.65 on MMLU. If fine-tuning drops this by more than 3–5%, you may be catastrophically forgetting general knowledge.
HellaSwag
Tests commonsense reasoning. Useful for checking that fine-tuning didn't damage reasoning ability.
TruthfulQA
Tests whether the model gives truthful answers to questions that might tempt it to hallucinate. Critical for domain-specific models where factual accuracy matters.
MT-Bench
Conversational multi-turn benchmark. Tests instruction following, coding, math, and reasoning across 8 categories.
Using lm-evaluation-harness
The standard tool for benchmarking:
pip install lm-evalfrom lm_eval import evaluator, tasks
# Evaluate fine-tuned model on a suite of benchmarks
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./fine-tuned-model,dtype=bfloat16",
tasks=["mmlu", "hellaswag", "truthfulqa_mc1", "arc_challenge"],
num_fewshot=0, # Zero-shot
batch_size=8,
device="cuda",
)
# Print summary
for task_name, task_results in results["results"].items():
for metric, value in task_results.items():
if not metric.endswith("_stderr"):
print(f"{task_name} | {metric}: {value:.4f}")Run the same evaluation on the base model and your fine-tuned model. The difference is your fine-tuning delta.
Domain-Specific Benchmark Construction
Standard benchmarks measure general ability. Build a domain benchmark for your specific task:
# Example: clinical pharmacology benchmark
CLINICAL_BENCHMARK = [
{
"id": "pharm_001",
"category": "drug_interactions",
"question": "Which of the following best describes the mechanism of the warfarin-aspirin interaction?",
"choices": [
"A. Aspirin inhibits warfarin metabolism via CYP2C9",
"B. Aspirin inhibits platelet aggregation and displaces warfarin from protein binding",
"C. Aspirin increases vitamin K availability, antagonizing warfarin",
"D. Aspirin directly inhibits vitamin K epoxide reductase",
],
"answer": "B",
},
{
"id": "pharm_002",
"category": "pharmacokinetics",
"question": "A patient with eGFR 25 mL/min is prescribed metformin. What is the correct action?",
"choices": [
"A. Continue at normal dose",
"B. Reduce dose by 50%",
"C. Contraindicated — discontinue metformin",
"D. Use with caution, monitor lactate monthly",
],
"answer": "C",
},
]
def evaluate_mcq_benchmark(model, tokenizer, benchmark: list[dict]) -> dict:
"""Evaluate model on multiple-choice benchmark."""
correct = 0
by_category = {}
for item in benchmark:
choices_text = "\n".join(item["choices"])
prompt = f"{item['question']}\n\n{choices_text}\n\nAnswer:"
response = generate_response(model, tokenizer, prompt, max_new_tokens=10)
# Extract letter from response
predicted = None
for letter in ["A", "B", "C", "D"]:
if letter in response[:20]:
predicted = letter
break
is_correct = predicted == item["answer"]
if is_correct:
correct += 1
cat = item.get("category", "unknown")
if cat not in by_category:
by_category[cat] = {"correct": 0, "total": 0}
by_category[cat]["total"] += 1
if is_correct:
by_category[cat]["correct"] += 1
total = len(benchmark)
return {
"overall_accuracy": correct / total,
"total_questions": total,
"by_category": {
cat: {"accuracy": v["correct"]/v["total"], "n": v["total"]}
for cat, v in by_category.items()
},
}Benchmark-Driven Development
Use benchmarks to guide training decisions:
Iteration 1: Fine-tune r=8, epochs=3
→ Task accuracy: 0.72, MMLU: 0.61 (base: 0.63)
→ MMLU dropped 2% — acceptable
Iteration 2: Fine-tune r=16, epochs=3
→ Task accuracy: 0.78, MMLU: 0.59 (base: 0.63)
→ MMLU dropped 4% — too much forgetting
Iteration 3: Fine-tune r=8, epochs=2 (stop earlier)
→ Task accuracy: 0.74, MMLU: 0.62 (base: 0.63)
→ Best balance: +2% task, -1% MMLUThis is the core trade-off: more fine-tuning → better task performance but more forgetting. Benchmarks make this trade-off visible and quantifiable.
Comparing Fine-Tuned Models
import pandas as pd
def compare_models(model_results: dict[str, dict]) -> pd.DataFrame:
"""Compare multiple fine-tuned model variants side by side."""
rows = []
for model_name, results in model_results.items():
row = {"model": model_name}
row.update(results)
rows.append(row)
df = pd.DataFrame(rows).set_index("model")
return df
results = {
"base_llama_8b": {"mmlu": 0.630, "hellaswag": 0.810, "task_accuracy": 0.61},
"ft_r8_ep2": {"mmlu": 0.621, "hellaswag": 0.805, "task_accuracy": 0.74},
"ft_r16_ep3": {"mmlu": 0.590, "hellaswag": 0.797, "task_accuracy": 0.78},
"ft_r8_ep3": {"mmlu": 0.615, "hellaswag": 0.803, "task_accuracy": 0.76},
}
df = compare_models(results)
print(df.to_string())When to Stop Iterating
Stop when:
- Task benchmark accuracy meets your minimum threshold (e.g., 80% on domain MCQ)
- General benchmark regression is within acceptable limits (less than 3% MMLU drop)
- Marginal improvements from additional training rounds are under 1–2%
- Human evaluation confirms real-world quality on representative queries
Don't optimize benchmarks indefinitely — they're proxies for real-world quality, not the goal itself.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.