Human Evaluation vs Automated Evaluation
When to use human evaluators, when to use automated metrics, and how to combine both for reliable, scalable LLM quality assurance.
Human Evaluation vs Automated Evaluation
Every LLM evaluation program faces the same trade-off: human evaluation is the gold standard but it is slow, expensive, and hard to scale. Automated evaluation is fast and cheap but may miss the nuance that matters most.
This lesson shows you how to combine both intelligently.
Human Evaluation: Strengths and Weaknesses
Human evaluators can understand intent, detect subtle errors, assess tone, and apply contextual judgment that no metric can replicate.
Strengths:
- Captures nuance: appropriateness, empathy, hedging, cultural context
- Detects hallucinations that "sound right" but are factually wrong
- Can handle novel tasks with no established metric
- Is the ground truth for user experience
Weaknesses:
- Slow: a single human can rate maybe 50-100 responses per hour
- Expensive: domain experts cost significantly more than general annotators
- Not reproducible: the same rater may give different scores on different days
- Not scalable: you cannot run human eval on every commit
# Rough cost model for human evaluation
def estimate_human_eval_cost(
n_examples: int,
rate_per_hour: float,
examples_per_hour: int,
n_raters_per_example: int = 3,
) -> dict:
total_ratings = n_examples * n_raters_per_example
hours = total_ratings / examples_per_hour
cost = hours * rate_per_hour
return {
"examples": n_examples,
"total_ratings": total_ratings,
"hours": round(hours, 1),
"cost_usd": round(cost, 2),
}
# General annotator (e.g., for tone/clarity)
general = estimate_human_eval_cost(
n_examples=500,
rate_per_hour=25,
examples_per_hour=60,
)
print("General annotator:", general)
# {'examples': 500, 'total_ratings': 1500, 'hours': 25.0, 'cost_usd': 625.0}
# Domain expert (e.g., for medical accuracy)
expert = estimate_human_eval_cost(
n_examples=500,
rate_per_hour=150,
examples_per_hour=30,
)
print("Domain expert:", expert)
# {'examples': 500, 'total_ratings': 1500, 'hours': 50.0, 'cost_usd': 7500.0}For a 500-example medical eval with 3 expert raters: roughly $7,500 and 50 hours. This is why you do not run human eval on every pull request.
Automated Evaluation: Strengths and Weaknesses
Automated metrics compute a score from the model output, either by comparing it to a reference answer or by applying a heuristic.
Strengths:
- Fast: thousands of examples in seconds
- Cheap: compute cost only
- Reproducible: same script, same score every time
- Scalable: runs in CI on every commit
Weaknesses:
- May penalize valid paraphrases
- Cannot detect factual errors that are fluently expressed
- No contextual judgment (tone, appropriateness, safety)
- Optimizing for the metric can diverge from actual quality
# Simple automated eval pipeline
import json
from pathlib import Path
def run_automated_eval(
dataset_path: str,
model_fn, # function that takes prompt, returns string
metric_fn, # function that takes (output, reference) -> float
) -> dict:
examples = []
with open(dataset_path) as f:
for line in f:
obj = json.loads(line)
if not obj.get("_metadata"):
examples.append(obj)
scores = []
failures = []
for ex in examples:
try:
output = model_fn(ex["prompt"])
score = metric_fn(output, ex["ideal_response"])
scores.append({
"id": ex["id"],
"score": score,
"output": output,
})
except Exception as e:
failures.append({"id": ex["id"], "error": str(e)})
all_scores = [s["score"] for s in scores]
return {
"n_examples": len(examples),
"n_scored": len(scores),
"n_failed": len(failures),
"mean_score": sum(all_scores) / len(all_scores) if all_scores else 0,
"min_score": min(all_scores) if all_scores else 0,
"max_score": max(all_scores) if all_scores else 0,
"failures": failures,
}Inter-Annotator Agreement: Measuring Human Consistency
Before trusting human eval results, measure whether your raters agree with each other. Low agreement means your annotation guidelines are ambiguous or your task is too subjective.
The most common measure is Cohen's Kappa for categorical ratings and Krippendorff's Alpha for ordinal or continuous scales.
from collections import Counter
import numpy as np
def cohens_kappa(rater_a: list[int], rater_b: list[int]) -> float:
"""Compute Cohen's Kappa for two raters giving categorical scores."""
assert len(rater_a) == len(rater_b), "Raters must score the same examples"
n = len(rater_a)
categories = sorted(set(rater_a) | set(rater_b))
# Observed agreement
p_o = sum(a == b for a, b in zip(rater_a, rater_b)) / n
# Expected agreement by chance
count_a = Counter(rater_a)
count_b = Counter(rater_b)
p_e = sum(
(count_a.get(c, 0) / n) * (count_b.get(c, 0) / n)
for c in categories
)
if p_e == 1.0:
return 1.0 # perfect agreement, avoid division by zero
kappa = (p_o - p_e) / (1 - p_e)
return round(kappa, 4)
# Example: two medical experts rate 10 responses on helpfulness (1-3)
rater_a = [3, 2, 3, 1, 2, 3, 2, 1, 3, 2]
rater_b = [3, 2, 2, 1, 2, 3, 3, 1, 3, 2]
kappa = cohens_kappa(rater_a, rater_b)
print(f"Cohen's Kappa: {kappa}")
# Interpretation guide
def interpret_kappa(k: float) -> str:
if k < 0:
return "Poor (worse than chance)"
elif k < 0.20:
return "Slight agreement"
elif k < 0.40:
return "Fair agreement"
elif k < 0.60:
return "Moderate agreement"
elif k < 0.80:
return "Substantial agreement"
else:
return "Almost perfect agreement"
print(f"Interpretation: {interpret_kappa(kappa)}")Target kappa values by task:
| Task | Acceptable Kappa | |------|-----------------| | Safety classification | 0.80+ | | Factual accuracy | 0.70+ | | Helpfulness (ordinal) | 0.60+ | | Tone/style | 0.50+ | | Overall quality | 0.55+ |
If kappa falls below target, run a calibration session with all raters before collecting more data.
The Hybrid Approach
The practical answer is to use both, at different frequencies and for different purposes.
Frequency | Method | Purpose
----------------|-----------------|----------------------------------------
Every commit | Unit assertions | Catch hard regressions
Every PR | Automated eval | Catch metric regressions
Weekly | LLM-as-judge | Catch nuanced quality changes
Monthly/Quarterly | Human eval | Ground truth calibration + safety auditCalibrating Automated Metrics Against Human Scores
The most important hybrid practice: periodically check that your automated metric correlates with human scores. If they diverge, your automated metric is misleading you.
import numpy as np
from scipy import stats
def check_metric_human_correlation(
automated_scores: list[float],
human_scores: list[float],
metric_name: str,
) -> dict:
"""Compute Pearson and Spearman correlation between automated and human scores."""
assert len(automated_scores) == len(human_scores)
pearson_r, pearson_p = stats.pearsonr(automated_scores, human_scores)
spearman_r, spearman_p = stats.spearmanr(automated_scores, human_scores)
result = {
"metric": metric_name,
"n": len(automated_scores),
"pearson_r": round(pearson_r, 3),
"pearson_p": round(pearson_p, 4),
"spearman_r": round(spearman_r, 3),
"spearman_p": round(spearman_p, 4),
"correlation_strength": "",
}
r = abs(spearman_r)
if r >= 0.7:
result["correlation_strength"] = "Strong — metric is reliable"
elif r >= 0.5:
result["correlation_strength"] = "Moderate — use with caution"
else:
result["correlation_strength"] = "Weak — consider a different metric"
return result
# Simulated data: 50 examples with ROUGE-L and human helpfulness scores
np.random.seed(42)
rouge_scores = np.random.uniform(0.2, 0.9, 50)
# Simulate moderate correlation with human scores
human_scores = 0.6 * rouge_scores + 0.4 * np.random.uniform(1, 5, 50) / 5 + np.random.normal(0, 0.05, 50)
human_scores = np.clip(human_scores, 0, 1)
result = check_metric_human_correlation(
rouge_scores.tolist(),
human_scores.tolist(),
"ROUGE-L"
)
print(result)When to Rely on Human Eval
Some situations demand human evaluation regardless of cost:
Safety-critical outputs: Medical advice, legal information, mental health support. A model that confidently gives wrong advice at 0.1% of queries can cause real harm. Automated metrics miss these edge cases.
Novel tasks: If you have no established metric and no reference answers, humans are the only option.
Regulatory requirements: In healthcare (HIPAA), finance, and legal contexts, you may need documented evidence of expert review.
Model release decisions: Before deploying a new model to production, run a human evaluation. Automated metrics alone are not sufficient due diligence.
# Example: flagging responses for mandatory human review
HUMAN_REVIEW_TRIGGERS = [
"suicide", "self-harm", "overdose", "emergency",
"lawsuit", "illegal", "criminal",
"classified", "confidential",
]
def requires_human_review(response: str, context: dict) -> bool:
text_lower = response.lower()
# Safety triggers
for trigger in HUMAN_REVIEW_TRIGGERS:
if trigger in text_lower:
return True
# High-stakes tasks always require review
if context.get("task_type") in ["medical_diagnosis", "legal_advice"]:
return True
# Low confidence from automated scorer
if context.get("auto_score", 1.0) < 0.4:
return True
return False
# Example usage in a production pipeline
response = "For severe overdose symptoms, call 911 immediately."
context = {"task_type": "medical_qa", "auto_score": 0.72}
if requires_human_review(response, context):
print("FLAGGED: Queued for human review")
else:
print("OK: Passed automated checks")Building an Annotation Workflow
When you do run human eval, make it efficient.
# Annotation interface structure (backend data model)
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
@dataclass
class AnnotationTask:
task_id: str
prompt: str
response: str
criteria: list[str] # e.g., ["accuracy", "safety", "helpfulness"]
scale: dict # e.g., {"min": 1, "max": 5, "labels": {1: "very bad", 5: "excellent"}}
assigned_to: str
deadline: str
@dataclass
class Annotation:
task_id: str
annotator_id: str
scores: dict[str, int] # {"accuracy": 4, "safety": 5, "helpfulness": 3}
notes: Optional[str]
flagged_for_review: bool
completed_at: str
def aggregate_annotations(
annotations: list[Annotation],
method: str = "mean",
) -> dict[str, float]:
"""Aggregate multiple annotations for the same task."""
from collections import defaultdict
scores_by_criterion = defaultdict(list)
for ann in annotations:
for criterion, score in ann.scores.items():
scores_by_criterion[criterion].append(score)
result = {}
for criterion, scores in scores_by_criterion.items():
if method == "mean":
result[criterion] = round(sum(scores) / len(scores), 2)
elif method == "median":
sorted_scores = sorted(scores)
n = len(sorted_scores)
result[criterion] = sorted_scores[n // 2]
elif method == "majority":
from collections import Counter
result[criterion] = Counter(scores).most_common(1)[0][0]
return resultKey Takeaways
- Human eval is the gold standard but too slow and expensive for every commit.
- Automated eval is scalable and reproducible but misses nuance and safety issues.
- Measure inter-annotator agreement (Cohen's Kappa) before trusting human scores.
- Calibrate automated metrics against human scores at least quarterly.
- Always use human eval for safety-critical, legal, or medical outputs.
- Build a hybrid pyramid: automated in CI, human on a periodic schedule.
What's Next
In eval-task-types.mdx, you will learn which specific metrics to use for classification, generation, RAG, code generation, and conversation tasks.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.