Human Evaluation vs Automated Evaluation

Every LLM evaluation program faces the same trade-off: human evaluation is the gold standard but it is slow, expensive, and hard to scale. Automated evaluation is fast and cheap but may miss the nuance that matters most.

This lesson shows you how to combine both intelligently.

Human Evaluation: Strengths and Weaknesses

Human evaluators can understand intent, detect subtle errors, assess tone, and apply contextual judgment that no metric can replicate.

Strengths:

Captures nuance: appropriateness, empathy, hedging, cultural context
Detects hallucinations that "sound right" but are factually wrong
Can handle novel tasks with no established metric
Is the ground truth for user experience

Weaknesses:

Slow: a single human can rate maybe 50-100 responses per hour
Expensive: domain experts cost significantly more than general annotators
Not reproducible: the same rater may give different scores on different days
Not scalable: you cannot run human eval on every commit

Python

# Rough cost model for human evaluation
def estimate_human_eval_cost(
    n_examples: int,
    rate_per_hour: float,
    examples_per_hour: int,
    n_raters_per_example: int = 3,
) -> dict:
    total_ratings = n_examples * n_raters_per_example
    hours = total_ratings / examples_per_hour
    cost = hours * rate_per_hour
    
    return {
        "examples": n_examples,
        "total_ratings": total_ratings,
        "hours": round(hours, 1),
        "cost_usd": round(cost, 2),
    }

# General annotator (e.g., for tone/clarity)
general = estimate_human_eval_cost(
    n_examples=500,
    rate_per_hour=25,
    examples_per_hour=60,
)
print("General annotator:", general)
# {'examples': 500, 'total_ratings': 1500, 'hours': 25.0, 'cost_usd': 625.0}

# Domain expert (e.g., for medical accuracy)
expert = estimate_human_eval_cost(
    n_examples=500,
    rate_per_hour=150,
    examples_per_hour=30,
)
print("Domain expert:", expert)
# {'examples': 500, 'total_ratings': 1500, 'hours': 50.0, 'cost_usd': 7500.0}

For a 500-example medical eval with 3 expert raters: roughly $7,500 and 50 hours. This is why you do not run human eval on every pull request.

Automated Evaluation: Strengths and Weaknesses

Automated metrics compute a score from the model output, either by comparing it to a reference answer or by applying a heuristic.

Strengths:

Fast: thousands of examples in seconds
Cheap: compute cost only
Reproducible: same script, same score every time
Scalable: runs in CI on every commit

Weaknesses:

May penalize valid paraphrases
Cannot detect factual errors that are fluently expressed
No contextual judgment (tone, appropriateness, safety)
Optimizing for the metric can diverge from actual quality

Python

# Simple automated eval pipeline
import json
from pathlib import Path

def run_automated_eval(
    dataset_path: str,
    model_fn,  # function that takes prompt, returns string
    metric_fn,  # function that takes (output, reference) -> float
) -> dict:
    examples = []
    with open(dataset_path) as f:
        for line in f:
            obj = json.loads(line)
            if not obj.get("_metadata"):
                examples.append(obj)
    
    scores = []
    failures = []
    
    for ex in examples:
        try:
            output = model_fn(ex["prompt"])
            score = metric_fn(output, ex["ideal_response"])
            scores.append({
                "id": ex["id"],
                "score": score,
                "output": output,
            })
        except Exception as e:
            failures.append({"id": ex["id"], "error": str(e)})
    
    all_scores = [s["score"] for s in scores]
    return {
        "n_examples": len(examples),
        "n_scored": len(scores),
        "n_failed": len(failures),
        "mean_score": sum(all_scores) / len(all_scores) if all_scores else 0,
        "min_score": min(all_scores) if all_scores else 0,
        "max_score": max(all_scores) if all_scores else 0,
        "failures": failures,
    }

Inter-Annotator Agreement: Measuring Human Consistency

Before trusting human eval results, measure whether your raters agree with each other. Low agreement means your annotation guidelines are ambiguous or your task is too subjective.

The most common measure is Cohen's Kappa for categorical ratings and Krippendorff's Alpha for ordinal or continuous scales.

Python

from collections import Counter
import numpy as np

def cohens_kappa(rater_a: list[int], rater_b: list[int]) -> float:
    """Compute Cohen's Kappa for two raters giving categorical scores."""
    assert len(rater_a) == len(rater_b), "Raters must score the same examples"
    n = len(rater_a)
    
    categories = sorted(set(rater_a) | set(rater_b))
    
    # Observed agreement
    p_o = sum(a == b for a, b in zip(rater_a, rater_b)) / n
    
    # Expected agreement by chance
    count_a = Counter(rater_a)
    count_b = Counter(rater_b)
    p_e = sum(
        (count_a.get(c, 0) / n) * (count_b.get(c, 0) / n)
        for c in categories
    )
    
    if p_e == 1.0:
        return 1.0  # perfect agreement, avoid division by zero
    
    kappa = (p_o - p_e) / (1 - p_e)
    return round(kappa, 4)


# Example: two medical experts rate 10 responses on helpfulness (1-3)
rater_a = [3, 2, 3, 1, 2, 3, 2, 1, 3, 2]
rater_b = [3, 2, 2, 1, 2, 3, 3, 1, 3, 2]

kappa = cohens_kappa(rater_a, rater_b)
print(f"Cohen's Kappa: {kappa}")

# Interpretation guide
def interpret_kappa(k: float) -> str:
    if k < 0:
        return "Poor (worse than chance)"
    elif k < 0.20:
        return "Slight agreement"
    elif k < 0.40:
        return "Fair agreement"
    elif k < 0.60:
        return "Moderate agreement"
    elif k < 0.80:
        return "Substantial agreement"
    else:
        return "Almost perfect agreement"

print(f"Interpretation: {interpret_kappa(kappa)}")

Target kappa values by task:

| Task | Acceptable Kappa | |------|-----------------| | Safety classification | 0.80+ | | Factual accuracy | 0.70+ | | Helpfulness (ordinal) | 0.60+ | | Tone/style | 0.50+ | | Overall quality | 0.55+ |

If kappa falls below target, run a calibration session with all raters before collecting more data.

The Hybrid Approach

The practical answer is to use both, at different frequencies and for different purposes.

Frequency       | Method          | Purpose
----------------|-----------------|----------------------------------------
Every commit    | Unit assertions | Catch hard regressions
Every PR        | Automated eval  | Catch metric regressions
Weekly          | LLM-as-judge    | Catch nuanced quality changes
Monthly/Quarterly | Human eval   | Ground truth calibration + safety audit

Calibrating Automated Metrics Against Human Scores

The most important hybrid practice: periodically check that your automated metric correlates with human scores. If they diverge, your automated metric is misleading you.

Python

import numpy as np
from scipy import stats

def check_metric_human_correlation(
    automated_scores: list[float],
    human_scores: list[float],
    metric_name: str,
) -> dict:
    """Compute Pearson and Spearman correlation between automated and human scores."""
    assert len(automated_scores) == len(human_scores)
    
    pearson_r, pearson_p = stats.pearsonr(automated_scores, human_scores)
    spearman_r, spearman_p = stats.spearmanr(automated_scores, human_scores)
    
    result = {
        "metric": metric_name,
        "n": len(automated_scores),
        "pearson_r": round(pearson_r, 3),
        "pearson_p": round(pearson_p, 4),
        "spearman_r": round(spearman_r, 3),
        "spearman_p": round(spearman_p, 4),
        "correlation_strength": "",
    }
    
    r = abs(spearman_r)
    if r >= 0.7:
        result["correlation_strength"] = "Strong — metric is reliable"
    elif r >= 0.5:
        result["correlation_strength"] = "Moderate — use with caution"
    else:
        result["correlation_strength"] = "Weak — consider a different metric"
    
    return result


# Simulated data: 50 examples with ROUGE-L and human helpfulness scores
np.random.seed(42)
rouge_scores = np.random.uniform(0.2, 0.9, 50)
# Simulate moderate correlation with human scores
human_scores = 0.6 * rouge_scores + 0.4 * np.random.uniform(1, 5, 50) / 5 + np.random.normal(0, 0.05, 50)
human_scores = np.clip(human_scores, 0, 1)

result = check_metric_human_correlation(
    rouge_scores.tolist(),
    human_scores.tolist(),
    "ROUGE-L"
)
print(result)

When to Rely on Human Eval

Some situations demand human evaluation regardless of cost:

Safety-critical outputs: Medical advice, legal information, mental health support. A model that confidently gives wrong advice at 0.1% of queries can cause real harm. Automated metrics miss these edge cases.

Novel tasks: If you have no established metric and no reference answers, humans are the only option.

Regulatory requirements: In healthcare (HIPAA), finance, and legal contexts, you may need documented evidence of expert review.

Model release decisions: Before deploying a new model to production, run a human evaluation. Automated metrics alone are not sufficient due diligence.

Python

# Example: flagging responses for mandatory human review
HUMAN_REVIEW_TRIGGERS = [
    "suicide", "self-harm", "overdose", "emergency",
    "lawsuit", "illegal", "criminal",
    "classified", "confidential",
]

def requires_human_review(response: str, context: dict) -> bool:
    text_lower = response.lower()
    
    # Safety triggers
    for trigger in HUMAN_REVIEW_TRIGGERS:
        if trigger in text_lower:
            return True
    
    # High-stakes tasks always require review
    if context.get("task_type") in ["medical_diagnosis", "legal_advice"]:
        return True
    
    # Low confidence from automated scorer
    if context.get("auto_score", 1.0) < 0.4:
        return True
    
    return False


# Example usage in a production pipeline
response = "For severe overdose symptoms, call 911 immediately."
context = {"task_type": "medical_qa", "auto_score": 0.72}

if requires_human_review(response, context):
    print("FLAGGED: Queued for human review")
else:
    print("OK: Passed automated checks")

Building an Annotation Workflow

When you do run human eval, make it efficient.

Python

# Annotation interface structure (backend data model)
from dataclasses import dataclass
from typing import Optional
from datetime import datetime

@dataclass
class AnnotationTask:
    task_id: str
    prompt: str
    response: str
    criteria: list[str]  # e.g., ["accuracy", "safety", "helpfulness"]
    scale: dict  # e.g., {"min": 1, "max": 5, "labels": {1: "very bad", 5: "excellent"}}
    assigned_to: str
    deadline: str


@dataclass
class Annotation:
    task_id: str
    annotator_id: str
    scores: dict[str, int]  # {"accuracy": 4, "safety": 5, "helpfulness": 3}
    notes: Optional[str]
    flagged_for_review: bool
    completed_at: str


def aggregate_annotations(
    annotations: list[Annotation],
    method: str = "mean",
) -> dict[str, float]:
    """Aggregate multiple annotations for the same task."""
    from collections import defaultdict
    
    scores_by_criterion = defaultdict(list)
    for ann in annotations:
        for criterion, score in ann.scores.items():
            scores_by_criterion[criterion].append(score)
    
    result = {}
    for criterion, scores in scores_by_criterion.items():
        if method == "mean":
            result[criterion] = round(sum(scores) / len(scores), 2)
        elif method == "median":
            sorted_scores = sorted(scores)
            n = len(sorted_scores)
            result[criterion] = sorted_scores[n // 2]
        elif method == "majority":
            from collections import Counter
            result[criterion] = Counter(scores).most_common(1)[0][0]
    
    return result

Key Takeaways

Human eval is the gold standard but too slow and expensive for every commit.
Automated eval is scalable and reproducible but misses nuance and safety issues.
Measure inter-annotator agreement (Cohen's Kappa) before trusting human scores.
Calibrate automated metrics against human scores at least quarterly.
Always use human eval for safety-critical, legal, or medical outputs.
Build a hybrid pyramid: automated in CI, human on a periodic schedule.

What's Next

In eval-task-types.mdx, you will learn which specific metrics to use for classification, generation, RAG, code generation, and conversation tasks.