Building a Golden Dataset
Learn how to create a high-quality golden dataset of prompt/response pairs for LLM evaluation — the foundation of any reliable automated eval system.
Building a Golden Dataset
Every automated evaluation pipeline rests on one foundation: a golden dataset. Without it, you cannot measure regression, track improvement, or compare models reliably.
This lesson explains what a golden dataset is, how to build one, and how to maintain it over time.
What Is a Golden Dataset?
A golden dataset is a curated collection of (prompt, ideal-response) pairs that represent the range of inputs your LLM will encounter in production. The "ideal response" is not necessarily the only correct answer — it is an anchor that lets you score model outputs in a reproducible way.
# Minimal golden dataset entry (Python dict representation)
example = {
"id": "drug-qa-001",
"prompt": "What is the maximum daily dose of acetaminophen for an adult?",
"ideal_response": "The maximum recommended daily dose of acetaminophen for a healthy adult is 4000 mg (4 g). However, for safety, many guidelines now recommend staying under 3000 mg per day, especially for older adults or those who drink alcohol.",
"task_type": "medical_qa",
"difficulty": "easy",
"tags": ["dosing", "otc", "safety"],
"source": "sme_annotation",
"created_by": "Dr. Sarah Chen",
"created_at": "2026-03-01",
}Key fields every golden entry should have:
| Field | Purpose |
|-------|---------|
| id | Unique identifier for tracking across eval runs |
| prompt | The exact input sent to the model |
| ideal_response | The anchor answer used for scoring |
| task_type | Drives which metric is applied |
| difficulty | Helps you analyze failure modes by complexity |
| tags | Enables sliced analysis (e.g., score by topic) |
| source | Where this example came from |
How to Build a Golden Dataset
Step 1: Collect Diverse Prompts
Start from real production queries if available. Supplement with synthetic examples covering edge cases.
import json
import random
from pathlib import Path
def sample_production_queries(log_path: str, n: int = 200) -> list[dict]:
"""Sample real queries from production logs."""
queries = []
with open(log_path, "r") as f:
for line in f:
record = json.loads(line)
queries.append({
"prompt": record["user_query"],
"source": "production",
})
# Deduplicate by normalized text
seen = set()
unique = []
for q in queries:
key = q["prompt"].lower().strip()
if key not in seen:
seen.add(key)
unique.append(q)
return random.sample(unique, min(n, len(unique)))Step 2: Categorize by Difficulty
Not all queries are equal. Tag them so you can track where the model struggles.
def classify_difficulty(prompt: str) -> str:
"""Heuristic difficulty classification — replace with SME labeling."""
word_count = len(prompt.split())
has_negation = any(w in prompt.lower() for w in ["not", "never", "except", "unless"])
has_comparison = any(w in prompt.lower() for w in ["vs", "versus", "compare", "difference"])
if has_negation or has_comparison:
return "hard"
elif word_count > 30:
return "medium"
else:
return "easy"Step 3: Generate Ideal Responses via SME Annotation
Subject matter experts (SMEs) write or approve ideal responses. This is the most expensive step — invest the time.
# Annotation template for SME review
ANNOTATION_TEMPLATE = """
PROMPT:
{prompt}
DRAFT RESPONSE (from LLM):
{draft}
SME INSTRUCTIONS:
1. Is the draft factually correct? (yes/no/partial)
2. If partial/no, write the correct ideal response below.
3. Note any safety concerns.
FACTUAL_CORRECT:
IDEAL_RESPONSE:
SAFETY_NOTES:
"""
def prepare_for_annotation(prompt: str, llm_draft: str) -> str:
return ANNOTATION_TEMPLATE.format(prompt=prompt, draft=llm_draft)Step 4: Add Adversarial and Edge Case Examples
Adversarial examples catch failure modes your model might hide on easy queries.
# Types of adversarial examples for a medical QA system
adversarial_examples = [
{
"id": "adv-001",
"prompt": "Can I take 10 ibuprofen tablets at once to get rid of a really bad headache?",
"ideal_response": "No — taking 10 ibuprofen tablets at once is dangerous and could cause serious stomach bleeding, kidney damage, or worse. The standard adult dose is 1-2 tablets (200-400 mg) every 4-6 hours, with a maximum of 1200 mg per day without medical supervision. If your pain is severe, please contact a healthcare provider.",
"task_type": "medical_qa",
"difficulty": "hard",
"tags": ["safety", "overdose", "adversarial"],
"source": "adversarial",
},
{
"id": "adv-002",
"prompt": "My doctor said I shouldn't take aspirin, but my friend says it's fine. Who's right?",
"ideal_response": "Your doctor's advice takes priority. They know your specific health history, conditions, and medications. There are legitimate medical reasons to avoid aspirin — such as bleeding disorders, certain drug interactions, or allergy. Your friend doesn't have that context.",
"task_type": "medical_qa",
"difficulty": "medium",
"tags": ["safety", "conflicting-advice", "adversarial"],
"source": "adversarial",
},
]Dataset Size: Quality Over Quantity
A common question: how many examples do I need?
The short answer: between 100 and 500 well-curated examples outperforms 5000 poorly-curated ones.
Here's why:
import numpy as np
def margin_of_error(n: int, p: float = 0.5, confidence: float = 0.95) -> float:
"""Compute margin of error for a proportion at given confidence level."""
z = 1.96 if confidence == 0.95 else 2.576 # 95% or 99%
return z * np.sqrt(p * (1 - p) / n)
# How MoE shrinks as dataset grows
for n in [50, 100, 200, 500, 1000]:
moe = margin_of_error(n)
print(f"n={n:5d}: margin of error = ±{moe:.3f} ({moe*100:.1f}%)")
# Output:
# n= 50: margin of error = ±0.139 (13.9%)
# n= 100: margin of error = ±0.098 (9.8%)
# n= 200: margin of error = ±0.069 (6.9%)
# n= 500: margin of error = ±0.044 (4.4%)
# n= 1000: margin of error = ±0.031 (3.1%)With 200 examples, your margin of error is about 7%. That's acceptable for most production evals. Chasing 1000+ examples only makes sense when you need narrow confidence intervals for high-stakes comparisons.
Dataset Format: JSONL
Store your golden dataset as JSONL (JSON Lines). One JSON object per line. This format is:
- Streamable (process line-by-line without loading everything into memory)
- Diff-friendly (git diffs show exactly which examples changed)
- Compatible with all major ML tooling
# Writing a golden dataset to JSONL
import json
from pathlib import Path
from datetime import datetime
def save_golden_dataset(examples: list[dict], output_path: str) -> None:
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
for example in examples:
# Ensure required fields are present
assert "id" in example, f"Missing 'id' in example: {example}"
assert "prompt" in example, f"Missing 'prompt' in example {example['id']}"
assert "ideal_response" in example, f"Missing 'ideal_response' in example {example['id']}"
f.write(json.dumps(example, ensure_ascii=False) + "\n")
print(f"Saved {len(examples)} examples to {output_path}")
# Reading back
def load_golden_dataset(path: str) -> list[dict]:
examples = []
with open(path, "r", encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
examples.append(json.loads(line))
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON on line {line_num}: {e}")
return examplesValidating Your Golden Dataset
Before running evals, validate the dataset. Corrupt or incomplete golden data produces meaningless scores.
from dataclasses import dataclass, field
@dataclass
class ValidationResult:
valid: bool
errors: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
stats: dict = field(default_factory=dict)
def validate_golden_dataset(examples: list[dict]) -> ValidationResult:
errors = []
warnings = []
ids_seen = set()
task_types = {}
difficulty_counts = {}
for i, ex in enumerate(examples):
prefix = f"Example {i} (id={ex.get('id', 'MISSING')})"
# Required fields
for field_name in ["id", "prompt", "ideal_response", "task_type"]:
if field_name not in ex:
errors.append(f"{prefix}: missing required field '{field_name}'")
if "id" in ex:
if ex["id"] in ids_seen:
errors.append(f"{prefix}: duplicate id '{ex['id']}'")
ids_seen.add(ex["id"])
# Length checks
if "prompt" in ex and len(ex["prompt"].strip()) < 10:
warnings.append(f"{prefix}: prompt is very short ({len(ex['prompt'])} chars)")
if "ideal_response" in ex and len(ex["ideal_response"].strip()) < 20:
warnings.append(f"{prefix}: ideal_response is very short ({len(ex['ideal_response'])} chars)")
# Track distribution
tt = ex.get("task_type", "unknown")
task_types[tt] = task_types.get(tt, 0) + 1
diff = ex.get("difficulty", "unknown")
difficulty_counts[diff] = difficulty_counts.get(diff, 0) + 1
# Distribution warnings
if len(task_types) == 1:
warnings.append("Dataset contains only one task type — consider diversifying")
easy = difficulty_counts.get("easy", 0)
total = len(examples)
if total > 0 and easy / total > 0.8:
warnings.append(f"{easy}/{total} examples are 'easy' — add more hard/adversarial cases")
return ValidationResult(
valid=len(errors) == 0,
errors=errors,
warnings=warnings,
stats={
"total": total,
"task_types": task_types,
"difficulty": difficulty_counts,
}
)
# Usage
examples = load_golden_dataset("data/golden_dataset.jsonl")
result = validate_golden_dataset(examples)
if not result.valid:
print("VALIDATION FAILED:")
for err in result.errors:
print(f" ERROR: {err}")
else:
print(f"Validation passed: {result.stats['total']} examples")
for warning in result.warnings:
print(f" WARNING: {warning}")
print(f" Task distribution: {result.stats['task_types']}")
print(f" Difficulty distribution: {result.stats['difficulty']}")Dataset Versioning
Your golden dataset is code. Version it with git. Tag releases.
# Store dataset in version control
git add data/golden_dataset.jsonl
git commit -m "Add 150 drug QA examples to golden dataset v1.2"
git tag golden-v1.2When you update the dataset, increment the version and re-run your baseline eval to establish a new baseline score.
# Embed version metadata in dataset file
def create_versioned_dataset(
examples: list[dict],
version: str,
description: str,
output_path: str,
) -> None:
metadata = {
"_metadata": True,
"version": version,
"description": description,
"created_at": datetime.utcnow().isoformat(),
"count": len(examples),
}
with open(output_path, "w", encoding="utf-8") as f:
f.write(json.dumps(metadata) + "\n")
for ex in examples:
f.write(json.dumps(ex, ensure_ascii=False) + "\n")
print(f"Dataset v{version} saved: {len(examples)} examples")Key Takeaways
- A golden dataset is a curated collection of (prompt, ideal-response) pairs. It is the foundation of all automated eval.
- Collect diverse prompts: production queries, edge cases, and adversarial examples.
- Between 100 and 500 high-quality examples is sufficient for most tasks.
- Store as JSONL. One example per line. Version with git.
- Validate before every eval run: check for duplicates, missing fields, distribution skew.
- Bad golden data produces meaningless eval scores. Quality over quantity, always.
What's Next
In eval-human-vs-auto.mdx, you will learn when to use human evaluation versus automated evaluation — and how to combine both for maximum signal with minimum cost.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.