Prompt Engineering Mastery · Lesson 21 of 24
Building a Prompt Eval Set
Why Evals Are Non-Negotiable
Prompts that "look right" fail on edge cases. Without a structured evaluation:
You think the prompt works because:
It works on the 3 examples you tested
The output looked reasonable at a glance
A colleague also thought it looked fine
It actually fails on:
Clinical notes with unusual formatting
Notes written by non-native English speakers
Comorbid cases where the primary diagnosis is ambiguous
Notes that contain adversarial content from patients
An eval set catches these before deployment.
A missing eval means you discover failures in production.What an Eval Set Contains
from dataclasses import dataclass
from typing import Any
@dataclass
class EvalCase:
id: str
input: Any # the actual input (note, message, document)
expected_output: Any # ground truth output (can be JSON, string, class label)
difficulty: str # 'easy', 'medium', 'hard'
tags: list[str] # ['comorbidity', 'long_note', 'edge_case']
notes: str # explanation of what this case tests
# Example
case = EvalCase(
id="med_001",
input="Pt with AF. On warfarin, NSAIDS. No INR in 6 months.",
expected_output={
"primary_diagnosis": "Atrial fibrillation",
"medications": ["Warfarin", "NSAIDs"],
"flags": ["NSAID-Warfarin interaction risk", "INR overdue"]
},
difficulty="medium",
tags=["drug_interaction", "monitoring_gap"],
notes="Tests detection of NSAID+Warfarin interaction and overdue monitoring."
)Automated Eval Pipeline
import json
from anthropic import Anthropic
from dataclasses import dataclass
@dataclass
class EvalResult:
case_id: str
passed: bool
score: float # 0.0-1.0
actual_output: Any
failure_reason: str | None
def run_eval(
prompt_fn, # function(input) -> str
eval_cases: list[EvalCase],
grade_fn, # function(case, actual) -> EvalResult
) -> list[EvalResult]:
results = []
for case in eval_cases:
try:
actual = prompt_fn(case.input)
result = grade_fn(case, actual)
except Exception as e:
result = EvalResult(case.id, False, 0.0, None, str(e))
results.append(result)
return results
def summarise_results(results: list[EvalResult]) -> dict:
passed = sum(1 for r in results if r.passed)
total = len(results)
mean_score = sum(r.score for r in results) / total if total else 0.0
failures = [r for r in results if not r.passed]
return {
"pass_rate": passed / total,
"mean_score": mean_score,
"total": total,
"failures": [{"id": f.case_id, "reason": f.failure_reason} for f in failures]
}Grading Functions
def exact_match_grader(case: EvalCase, actual_str: str) -> EvalResult:
"""Grade structured output by exact JSON schema match."""
import json
try:
actual = json.loads(actual_str)
# Check each required field
for key, expected_val in case.expected_output.items():
if key not in actual:
return EvalResult(case.id, False, 0.0, actual, f"Missing field: {key}")
if isinstance(expected_val, list):
# Check all expected items are in actual (order-independent)
missing = [v for v in expected_val if v not in actual[key]]
if missing:
return EvalResult(case.id, False, 0.5, actual,
f"Missing items in {key}: {missing}")
return EvalResult(case.id, True, 1.0, actual, None)
except json.JSONDecodeError as e:
return EvalResult(case.id, False, 0.0, actual_str, f"Invalid JSON: {e}")
def llm_grader(case: EvalCase, actual_str: str, judge_client) -> EvalResult:
"""Grade free-text or complex outputs using an LLM judge."""
prompt = f"""Grade the following response against the expected output.
Task: {case.notes}
Expected output: {json.dumps(case.expected_output, indent=2)}
Actual output: {actual_str}
Score 0.0-1.0:
1.0 = perfect match or equivalent
0.7 = mostly correct, minor omissions
0.5 = partially correct
0.0 = wrong or missing key information
Respond with JSON: {{"score": float, "reason": string}}"""
response = judge_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
result = json.loads(response.content[0].text)
return EvalResult(case.id, result["score"] >= 0.7, result["score"],
actual_str, result["reason"])Eval-Driven Development Workflow
1. Build eval set BEFORE iterating on the prompt
20-50 cases: easy + medium + hard + edge cases
Include cases you know the current prompt fails on
2. Baseline: run the current prompt → record pass rate
3. Make a prompt change
4. Run eval → compare pass rate
Improvement? Deploy.
Regression? Revert or investigate.
5. Add new failure cases to the eval set when you discover them in production
6. Run eval on every prompt change in CI/CD
"Prompt tests" should be as automatic as unit testsInterview Answer
"Prompt evaluation requires a structured test set — EvalCase objects with input, expected output, difficulty tags, and notes explaining what each case tests. The eval pipeline calls the prompt function on each case and grades the output via exact match (for structured JSON), partial match (field-level), or LLM-as-judge (for free text). Run evals before every prompt change and in CI/CD. The workflow is eval-driven: build the eval set first, establish a baseline, make changes, measure delta. A prompt that improves on your 3 happy-path examples but regresses on comorbidity or edge-case inputs should not be deployed — the eval catches this before production does."