Why Evaluating LLMs Is Hard

Evaluating a logistic regression model is straightforward: run it on a held-out test set, compute accuracy. Done.

Evaluating an LLM is nothing like that. This lesson explains why — and sets the foundation for every evaluation technique in this course.

The Core Problem

Traditional ML models produce a single, deterministic output for a given input. Given an image, a classifier outputs a class label. Given a structured input, a regression model outputs a number. You compare the output to a ground-truth label, compute a metric, and move on.

LLMs operate on open-ended text. The output space is practically infinite. Two responses can be completely different in wording yet both correct. One response can be shorter, more precise, more polite — and still lose to a longer one on a naive metric.

This creates five interlocking problems.

Problem 1: Non-Deterministic Outputs

The same prompt, run twice, can produce different answers.

Python

import anthropic

client = anthropic.Anthropic()

prompt = "Explain why the sky is blue in two sentences."

responses = []
for i in range(5):
    message = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    responses.append(message.content[0].text)

for i, r in enumerate(responses):
    print(f"Run {i+1}: {r[:100]}...")
    print()

Even with temperature=0 (greedy decoding), many model APIs still introduce slight variation through:

Floating-point non-determinism across GPU hardware
Batching effects when multiple requests are served together
Model updates deployed silently by providers

This means you cannot simply run an eval once and trust the result. You need to run it multiple times and report means and confidence intervals.

Python

import numpy as np

def eval_with_confidence(scores: list[float]) -> dict:
    arr = np.array(scores)
    mean = arr.mean()
    std = arr.std()
    n = len(arr)
    # 95% confidence interval (z=1.96)
    ci = 1.96 * std / np.sqrt(n)
    return {
        "mean": round(mean, 4),
        "std": round(std, 4),
        "ci_95": round(ci, 4),
        "n": n,
    }

# Simulated scores from 10 eval runs on same dataset
scores = [0.82, 0.85, 0.81, 0.83, 0.84, 0.86, 0.82, 0.80, 0.85, 0.83]
print(eval_with_confidence(scores))
# {'mean': 0.8310, 'std': 0.0173, 'ci_95': 0.0107, 'n': 10}

Rule of thumb: if your eval dataset has fewer than 200 examples, run the eval at least 3 times and average the results.

Problem 2: No Single Ground Truth

For a question like "What is 2 + 2?", there is one correct answer. For a question like "Summarize this article about climate change in 3 bullet points", there are hundreds of valid answers.

Python

# All of these are valid answers to the same summarization prompt
valid_answers = [
    "• Arctic ice is melting faster than models predicted\n• Carbon emissions hit record highs in 2025\n• Renewable energy adoption is accelerating but not fast enough",
    "1. Global temperatures rose 1.5°C above pre-industrial levels\n2. Extreme weather events are increasing in frequency\n3. Policy action remains insufficient",
    "The article covers accelerating climate impacts, record emissions, and slow policy response.",
]

This is sometimes called the one-to-many problem. There is one input but many correct outputs. Any metric that compares the model output to a single reference answer will penalize valid paraphrases.

The deeper issue: even defining "correct" is a judgment call. A medical AI might give a response that is technically accurate but lacks appropriate caution. Is that correct? Who decides?

Problem 3: Task Diversity Requires Different Metrics

LLMs are used for radically different tasks. Each task needs different evaluation criteria.

| Task | What Matters | Wrong Metric | |------|-------------|-------------| | Summarization | Coverage, concision, no hallucination | Exact match | | Classification | Accuracy, F1 per class | BLEU score | | Code generation | Execution success, correctness | ROUGE | | Medical Q&A | Accuracy, safety, appropriate uncertainty | Verbosity | | Conversation | Coherence, helpfulness, tone | Perplexity | | RAG | Faithfulness to context, relevance | N-gram overlap |

There is no universal LLM metric. The eval strategy must be designed per task.

Python

from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalStrategy:
    task_type: str
    primary_metric: str
    secondary_metrics: list[str]
    requires_reference: bool
    requires_human: bool

strategies = {
    "summarization": EvalStrategy(
        task_type="summarization",
        primary_metric="rouge_l",
        secondary_metrics=["bertscore_f1", "faithfulness"],
        requires_reference=True,
        requires_human=False,
    ),
    "code_generation": EvalStrategy(
        task_type="code_generation",
        primary_metric="pass_at_1",
        secondary_metrics=["syntax_valid", "test_coverage"],
        requires_reference=False,
        requires_human=False,
    ),
    "medical_qa": EvalStrategy(
        task_type="medical_qa",
        primary_metric="factual_accuracy",
        secondary_metrics=["safety_score", "appropriate_uncertainty"],
        requires_reference=True,
        requires_human=True,
    ),
    "conversation": EvalStrategy(
        task_type="conversation",
        primary_metric="user_satisfaction",
        secondary_metrics=["coherence", "helpfulness", "safety"],
        requires_reference=False,
        requires_human=True,
    ),
}

def select_strategy(task: str) -> EvalStrategy:
    if task not in strategies:
        raise ValueError(f"No eval strategy defined for task: {task}")
    return strategies[task]

strategy = select_strategy("medical_qa")
print(f"Primary metric: {strategy.primary_metric}")
print(f"Requires human: {strategy.requires_human}")

Problem 4: The Reference-Free Problem

Most traditional metrics require a reference answer to compare against. But reference answers:

Are expensive to collect (domain expert time)
Age out quickly (drug dosages change, laws change)
May not exist at all for novel queries

This leads to reference-free evaluation, where you assess quality without a gold standard.

Python

# Reference-based: compare model output to known correct answer
def reference_based_score(output: str, reference: str) -> float:
    # Simplified example
    output_words = set(output.lower().split())
    reference_words = set(reference.lower().split())
    overlap = output_words & reference_words
    recall = len(overlap) / len(reference_words) if reference_words else 0
    precision = len(overlap) / len(output_words) if output_words else 0
    f1 = 2 * precision * recall / (precision + recall + 1e-8)
    return f1

# Reference-free: use heuristics or a judge model
def reference_free_score(prompt: str, output: str, judge_client) -> float:
    judge_prompt = f"""Rate the following response on a scale from 1 to 5.

Question: {prompt}

Response: {output}

Rate ONLY on: accuracy, helpfulness, clarity.
Return a JSON object: {{"score": <integer 1-5>, "reason": "<one sentence>"}}"""

    result = judge_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=100,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    import json
    data = json.loads(result.content[0].text)
    return data["score"] / 5.0  # normalize to 0-1

Reference-free evaluation is powerful but introduces its own biases. If you use an LLM as the judge, you inherit that model's biases (covered later in eval-judge-bias.mdx).

Problem 5: Traditional ML Metrics Don't Apply Directly

Consider accuracy: what is the "accuracy" of a chatbot response? You cannot compute it without a binary ground truth label, which doesn't exist for open-ended text.

Consider F1 score: it requires true positives, false positives, and false negatives. For a generation task, how do you define a false positive?

Consider mean squared error (MSE): LLM outputs are not continuous numbers.

Python

# Why you can't naively apply accuracy to generation tasks

generated = "The capital of France is Paris, a beautiful city."
reference = "Paris is the capital of France."

# Exact match accuracy: 0 (strings are different)
exact_match = int(generated.strip() == reference.strip())
print(f"Exact match: {exact_match}")  # 0

# But the generated answer is correct! The metric failed.

# Token-level accuracy is also misleading
def token_accuracy(gen: str, ref: str) -> float:
    gen_tokens = gen.lower().split()
    ref_tokens = ref.lower().split()
    if len(ref_tokens) == 0:
        return 0.0
    correct = sum(1 for g, r in zip(gen_tokens, ref_tokens) if g == r)
    return correct / len(ref_tokens)

print(f"Token accuracy: {token_accuracy(generated, reference):.2f}")
# Low because word order differs, even though meaning is correct

The right framing is: fitness for purpose. A good LLM response is one that would satisfy the user's intent. That is a richer criterion than any single number can capture.

The Evaluation Pyramid

A practical LLM evaluation strategy layers multiple approaches:

               Human Audit (quarterly)
              /                        \
         LLM-as-Judge (weekly)
        /                    \
   Automated Metrics (every PR)
  /                            \
Unit Tests on Known Examples (every commit)

Layer 1 — Unit tests: Hard-coded assertions. "Given prompt X, the response must contain Y." Fast, deterministic, catches regressions.

Layer 2 — Automated metrics: ROUGE, BERTScore, pass-rate on golden dataset. Runs in CI, takes 1-2 minutes.

Layer 3 — LLM-as-judge: A capable model scores responses on a rubric. Runs weekly or on major changes.

Layer 4 — Human audit: Domain experts review a sample. Runs quarterly or before major releases.

Python

# Layer 1: Unit test style assertion
def assert_response_contains(response: str, required: str) -> bool:
    return required.lower() in response.lower()

# Example: test that medical AI always includes a disclaimer
response = "Ibuprofen is used for pain relief. Always consult your doctor before use."
assert assert_response_contains(response, "consult your doctor"), \
    "Medical response missing required disclaimer"
print("Safety assertion passed")


# Layer 2: Automated metric
def compute_length_appropriateness(response: str, min_words: int, max_words: int) -> bool:
    word_count = len(response.split())
    return min_words <= word_count <= max_words

# Layer 3 and 4 are covered in dedicated lessons

Key Takeaways

LLM outputs are non-deterministic. Always run evals multiple times and report confidence intervals.
There is rarely a single ground truth. Design metrics that tolerate valid variation.
Choose metrics per task. No universal metric exists.
Combine reference-based and reference-free evaluation.
Traditional ML metrics (accuracy, F1, MSE) need adaptation or replacement for generation tasks.
Build a layered evaluation pyramid: unit tests, automated metrics, LLM judge, human audit.

What's Next

In the next lesson (eval-golden-dataset.mdx), you will learn how to build the foundation of any automated eval: a golden dataset of curated prompt/response pairs.

Why LLM Evaluation Is Hard

Why Evaluating LLMs Is Hard

The Core Problem

Problem 1: Non-Deterministic Outputs

Problem 2: No Single Ground Truth

Problem 3: Task Diversity Requires Different Metrics

Problem 4: The Reference-Free Problem

Problem 5: Traditional ML Metrics Don't Apply Directly

The Evaluation Pyramid

Key Takeaways

What's Next