Building Prompt Evaluations

Why Evals Are Non-Negotiable

Prompts that "look right" fail on edge cases. Without a structured evaluation:

You think the prompt works because:
  It works on the 3 examples you tested
  The output looked reasonable at a glance
  A colleague also thought it looked fine

It actually fails on:
  Clinical notes with unusual formatting
  Notes written by non-native English speakers
  Comorbid cases where the primary diagnosis is ambiguous
  Notes that contain adversarial content from patients

An eval set catches these before deployment.
A missing eval means you discover failures in production.

What an Eval Set Contains

Python

from dataclasses import dataclass
from typing import Any

@dataclass
class EvalCase:
    id: str
    input: Any              # the actual input (note, message, document)
    expected_output: Any    # ground truth output (can be JSON, string, class label)
    difficulty: str         # 'easy', 'medium', 'hard'
    tags: list[str]         # ['comorbidity', 'long_note', 'edge_case']
    notes: str              # explanation of what this case tests

# Example
case = EvalCase(
    id="med_001",
    input="Pt with AF. On warfarin, NSAIDS. No INR in 6 months.",
    expected_output={
        "primary_diagnosis": "Atrial fibrillation",
        "medications": ["Warfarin", "NSAIDs"],
        "flags": ["NSAID-Warfarin interaction risk", "INR overdue"]
    },
    difficulty="medium",
    tags=["drug_interaction", "monitoring_gap"],
    notes="Tests detection of NSAID+Warfarin interaction and overdue monitoring."
)

Automated Eval Pipeline

Python

import json
from anthropic import Anthropic
from dataclasses import dataclass

@dataclass
class EvalResult:
    case_id: str
    passed: bool
    score: float          # 0.0-1.0
    actual_output: Any
    failure_reason: str | None

def run_eval(
    prompt_fn,          # function(input) -> str
    eval_cases: list[EvalCase],
    grade_fn,           # function(case, actual) -> EvalResult
) -> list[EvalResult]:
    results = []
    for case in eval_cases:
        try:
            actual = prompt_fn(case.input)
            result = grade_fn(case, actual)
        except Exception as e:
            result = EvalResult(case.id, False, 0.0, None, str(e))
        results.append(result)
    return results

def summarise_results(results: list[EvalResult]) -> dict:
    passed = sum(1 for r in results if r.passed)
    total = len(results)
    mean_score = sum(r.score for r in results) / total if total else 0.0
    failures = [r for r in results if not r.passed]
    return {
        "pass_rate": passed / total,
        "mean_score": mean_score,
        "total": total,
        "failures": [{"id": f.case_id, "reason": f.failure_reason} for f in failures]
    }

Grading Functions

Python

def exact_match_grader(case: EvalCase, actual_str: str) -> EvalResult:
    """Grade structured output by exact JSON schema match."""
    import json
    try:
        actual = json.loads(actual_str)
        # Check each required field
        for key, expected_val in case.expected_output.items():
            if key not in actual:
                return EvalResult(case.id, False, 0.0, actual, f"Missing field: {key}")
            if isinstance(expected_val, list):
                # Check all expected items are in actual (order-independent)
                missing = [v for v in expected_val if v not in actual[key]]
                if missing:
                    return EvalResult(case.id, False, 0.5, actual,
                                      f"Missing items in {key}: {missing}")
        return EvalResult(case.id, True, 1.0, actual, None)
    except json.JSONDecodeError as e:
        return EvalResult(case.id, False, 0.0, actual_str, f"Invalid JSON: {e}")

def llm_grader(case: EvalCase, actual_str: str, judge_client) -> EvalResult:
    """Grade free-text or complex outputs using an LLM judge."""
    prompt = f"""Grade the following response against the expected output.

Task: {case.notes}
Expected output: {json.dumps(case.expected_output, indent=2)}
Actual output: {actual_str}

Score 0.0-1.0:
  1.0 = perfect match or equivalent
  0.7 = mostly correct, minor omissions
  0.5 = partially correct
  0.0 = wrong or missing key information

Respond with JSON: {{"score": float, "reason": string}}"""

    response = judge_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    result = json.loads(response.content[0].text)
    return EvalResult(case.id, result["score"] >= 0.7, result["score"],
                      actual_str, result["reason"])

Eval-Driven Development Workflow

1. Build eval set BEFORE iterating on the prompt
   20-50 cases: easy + medium + hard + edge cases
   Include cases you know the current prompt fails on

2. Baseline: run the current prompt → record pass rate

3. Make a prompt change

4. Run eval → compare pass rate
   Improvement? Deploy.
   Regression? Revert or investigate.

5. Add new failure cases to the eval set when you discover them in production

6. Run eval on every prompt change in CI/CD
   "Prompt tests" should be as automatic as unit tests

Interview Answer

"Prompt evaluation requires a structured test set — EvalCase objects with input, expected output, difficulty tags, and notes explaining what each case tests. The eval pipeline calls the prompt function on each case and grades the output via exact match (for structured JSON), partial match (field-level), or LLM-as-judge (for free text). Run evals before every prompt change and in CI/CD. The workflow is eval-driven: build the eval set first, establish a baseline, make changes, measure delta. A prompt that improves on your 3 happy-path examples but regresses on comorbidity or edge-case inputs should not be deployed — the eval catches this before production does."

Building Prompt Evaluations

Why Evals Are Non-Negotiable

What an Eval Set Contains

Automated Eval Pipeline

Grading Functions

Eval-Driven Development Workflow

Interview Answer

Enjoyed this article?

Leave a comment