Fine-Tuning LLMs · Lesson 14 of 16
Domain-Specific Benchmarks for Fine-Tuning
Why Benchmarks Matter
Internal evaluation on your training distribution tells you if your model learned the training data. Standard benchmarks tell you if fine-tuning damaged general capabilities.
A common failure mode: fine-tuning improves task performance by 20% but reduces MMLU score by 15% — the model traded general reasoning for narrow specialization.
Standard LLM Benchmarks
MMLU (Massive Multitask Language Understanding)
Tests knowledge across 57 academic subjects (medicine, law, history, STEM). A good proxy for retained general knowledge after fine-tuning.
from lm_eval import evaluator
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./fine-tuned-model",
tasks=["mmlu"],
num_fewshot=5,
device="cuda",
)
print(f"MMLU: {results['results']['mmlu']['acc']:.3f}")Baseline: Most 7–8B instruction models score 0.60–0.65 on MMLU. If fine-tuning drops this by more than 3–5%, you may be catastrophically forgetting general knowledge.
HellaSwag
Tests commonsense reasoning. Useful for checking that fine-tuning didn't damage reasoning ability.
TruthfulQA
Tests whether the model gives truthful answers to questions that might tempt it to hallucinate. Critical for domain-specific models where factual accuracy matters.
MT-Bench
Conversational multi-turn benchmark. Tests instruction following, coding, math, and reasoning across 8 categories.
Using lm-evaluation-harness
The standard tool for benchmarking:
pip install lm-evalfrom lm_eval import evaluator, tasks
# Evaluate fine-tuned model on a suite of benchmarks
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./fine-tuned-model,dtype=bfloat16",
tasks=["mmlu", "hellaswag", "truthfulqa_mc1", "arc_challenge"],
num_fewshot=0, # Zero-shot
batch_size=8,
device="cuda",
)
# Print summary
for task_name, task_results in results["results"].items():
for metric, value in task_results.items():
if not metric.endswith("_stderr"):
print(f"{task_name} | {metric}: {value:.4f}")Run the same evaluation on the base model and your fine-tuned model. The difference is your fine-tuning delta.
Domain-Specific Benchmark Construction
Standard benchmarks measure general ability. Build a domain benchmark for your specific task:
# Example: clinical pharmacology benchmark
CLINICAL_BENCHMARK = [
{
"id": "pharm_001",
"category": "drug_interactions",
"question": "Which of the following best describes the mechanism of the warfarin-aspirin interaction?",
"choices": [
"A. Aspirin inhibits warfarin metabolism via CYP2C9",
"B. Aspirin inhibits platelet aggregation and displaces warfarin from protein binding",
"C. Aspirin increases vitamin K availability, antagonizing warfarin",
"D. Aspirin directly inhibits vitamin K epoxide reductase",
],
"answer": "B",
},
{
"id": "pharm_002",
"category": "pharmacokinetics",
"question": "A patient with eGFR 25 mL/min is prescribed metformin. What is the correct action?",
"choices": [
"A. Continue at normal dose",
"B. Reduce dose by 50%",
"C. Contraindicated — discontinue metformin",
"D. Use with caution, monitor lactate monthly",
],
"answer": "C",
},
]
def evaluate_mcq_benchmark(model, tokenizer, benchmark: list[dict]) -> dict:
"""Evaluate model on multiple-choice benchmark."""
correct = 0
by_category = {}
for item in benchmark:
choices_text = "\n".join(item["choices"])
prompt = f"{item['question']}\n\n{choices_text}\n\nAnswer:"
response = generate_response(model, tokenizer, prompt, max_new_tokens=10)
# Extract letter from response
predicted = None
for letter in ["A", "B", "C", "D"]:
if letter in response[:20]:
predicted = letter
break
is_correct = predicted == item["answer"]
if is_correct:
correct += 1
cat = item.get("category", "unknown")
if cat not in by_category:
by_category[cat] = {"correct": 0, "total": 0}
by_category[cat]["total"] += 1
if is_correct:
by_category[cat]["correct"] += 1
total = len(benchmark)
return {
"overall_accuracy": correct / total,
"total_questions": total,
"by_category": {
cat: {"accuracy": v["correct"]/v["total"], "n": v["total"]}
for cat, v in by_category.items()
},
}Benchmark-Driven Development
Use benchmarks to guide training decisions:
Iteration 1: Fine-tune r=8, epochs=3
→ Task accuracy: 0.72, MMLU: 0.61 (base: 0.63)
→ MMLU dropped 2% — acceptable
Iteration 2: Fine-tune r=16, epochs=3
→ Task accuracy: 0.78, MMLU: 0.59 (base: 0.63)
→ MMLU dropped 4% — too much forgetting
Iteration 3: Fine-tune r=8, epochs=2 (stop earlier)
→ Task accuracy: 0.74, MMLU: 0.62 (base: 0.63)
→ Best balance: +2% task, -1% MMLUThis is the core trade-off: more fine-tuning → better task performance but more forgetting. Benchmarks make this trade-off visible and quantifiable.
Comparing Fine-Tuned Models
import pandas as pd
def compare_models(model_results: dict[str, dict]) -> pd.DataFrame:
"""Compare multiple fine-tuned model variants side by side."""
rows = []
for model_name, results in model_results.items():
row = {"model": model_name}
row.update(results)
rows.append(row)
df = pd.DataFrame(rows).set_index("model")
return df
results = {
"base_llama_8b": {"mmlu": 0.630, "hellaswag": 0.810, "task_accuracy": 0.61},
"ft_r8_ep2": {"mmlu": 0.621, "hellaswag": 0.805, "task_accuracy": 0.74},
"ft_r16_ep3": {"mmlu": 0.590, "hellaswag": 0.797, "task_accuracy": 0.78},
"ft_r8_ep3": {"mmlu": 0.615, "hellaswag": 0.803, "task_accuracy": 0.76},
}
df = compare_models(results)
print(df.to_string())When to Stop Iterating
Stop when:
- Task benchmark accuracy meets your minimum threshold (e.g., 80% on domain MCQ)
- General benchmark regression is within acceptable limits (less than 3% MMLU drop)
- Marginal improvements from additional training rounds are under 1–2%
- Human evaluation confirms real-world quality on representative queries
Don't optimize benchmarks indefinitely — they're proxies for real-world quality, not the goal itself.