A/B Testing Prompts
How to run A/B tests on prompt versions in production — traffic splitting, measuring quality metrics, statistical significance, and gradual rollout strategies.
Why A/B Test Prompts
Offline evals tell you which prompt is better on your eval set. A/B testing tells you which prompt is better on actual production traffic:
Offline eval:
Your 50-case eval set → Prompt A: 85%, Prompt B: 89%
Prompt B looks better
But in production:
Real users send queries you didn't anticipate
Real users interact differently than your eval cases suggest
Model behaviour on real-time traffic may differ from your test set
A/B testing confirms that Prompt B is better on the real distribution.Traffic Splitting Architecture
import random
import hashlib
class PromptRouter:
def __init__(self, prompts: dict[str, str], weights: dict[str, float]):
"""
prompts: {"prompt_a": "...", "prompt_b": "..."}
weights: {"prompt_a": 0.5, "prompt_b": 0.5}
"""
assert abs(sum(weights.values()) - 1.0) < 1e-6, "Weights must sum to 1"
self.prompts = prompts
self.weights = weights
def get_variant(self, user_id: str) -> str:
"""Deterministic assignment: same user always gets same variant."""
# Hash user_id to [0, 1)
h = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = (h % 10000) / 10000.0
cumulative = 0.0
for variant, weight in self.weights.items():
cumulative += weight
if bucket < cumulative:
return variant
return list(self.prompts.keys())[-1] # fallback
def get_prompt(self, user_id: str) -> tuple[str, str]:
"""Returns (variant_name, prompt_text)."""
variant = self.get_variant(user_id)
return variant, self.prompts[variant]
# Usage
router = PromptRouter(
prompts={"control": PROMPT_V1, "treatment": PROMPT_V2},
weights={"control": 0.8, "treatment": 0.2} # 80/20 split
)
variant, prompt = router.get_prompt(user_id)Metrics to Measure
from dataclasses import dataclass, field
from collections import defaultdict
@dataclass
class VariantMetrics:
variant: str
requests: int = 0
success_count: int = 0
latency_ms: list[float] = field(default_factory=list)
user_ratings: list[int] = field(default_factory=list) # 1-5 thumbs up/down
schema_failures: int = 0
injection_detections: int = 0
token_count: list[int] = field(default_factory=list)
@property
def success_rate(self) -> float:
return self.success_count / max(self.requests, 1)
@property
def mean_latency_ms(self) -> float:
return sum(self.latency_ms) / max(len(self.latency_ms), 1)
@property
def mean_user_rating(self) -> float:
return sum(self.user_ratings) / max(len(self.user_ratings), 1)
@property
def mean_tokens(self) -> float:
return sum(self.token_count) / max(len(self.token_count), 1)
metrics_store: dict[str, VariantMetrics] = defaultdict(
lambda: VariantMetrics(variant="unknown")
)Statistical Significance
Don't make decisions based on small samples:
from scipy import stats
import numpy as np
def is_significant(
control_successes: int, control_n: int,
treatment_successes: int, treatment_n: int,
alpha: float = 0.05
) -> dict:
"""Two-proportion z-test for A/B comparison of success rates."""
p_control = control_successes / control_n
p_treatment = treatment_successes / treatment_n
# Pooled proportion
p_pool = (control_successes + treatment_successes) / (control_n + treatment_n)
se = np.sqrt(p_pool * (1 - p_pool) * (1/control_n + 1/treatment_n))
z = (p_treatment - p_control) / se if se > 0 else 0.0
p_value = 2 * (1 - stats.norm.cdf(abs(z))) # two-tailed
return {
"control_rate": p_control,
"treatment_rate": p_treatment,
"lift": p_treatment - p_control,
"relative_lift": (p_treatment - p_control) / max(p_control, 1e-9),
"p_value": p_value,
"is_significant": p_value < alpha,
"z_score": z,
}
# Example: needs ~400+ samples per variant for 80% power at 5% significance
# with a 5% absolute improvementGradual Rollout
Roll out new prompts progressively to limit blast radius:
Stage 1: Shadow mode (0% live traffic)
Run new prompt alongside current in background
Compare outputs — no real users affected
Stage 2: Canary (1-5%)
Small fraction of real traffic
Monitor for errors, regressions, unexpected output patterns
Stop immediately if anything looks wrong
Stage 3: Limited rollout (10-20%)
Enough traffic for meaningful stats in reasonable time
Monitor metrics daily
Stage 4: Controlled experiment (50/50)
Full A/B test for statistical significance
Typically 1-2 weeks of real traffic
Stage 5: Full rollout (100%)
After statistical significance confirmed
Keep old prompt available for rollbackCommon Pitfalls
1. Multiple hypothesis problem:
Testing 10 prompt variants simultaneously → 5% chance of false positive each
→ Use Bonferroni correction or pre-register one primary metric
2. Novelty effect:
Users engage more with new experiences initially
→ Run for at least 1-2 weeks to see steady-state behaviour
3. Selection bias in user ratings:
Only engaged users rate — may not represent all users
→ Use objective metrics (schema success rate, error rate) alongside ratings
4. Confounded samples:
Same user gets variant A one day and B the next
→ Assign users consistently via hashed user ID
5. Optimising the wrong metric:
Success rate might be high but user satisfaction low
→ Track multiple metrics; define primary metric before testingInterview Answer
"A/B testing prompts in production uses deterministic traffic splitting — hash the user ID to always assign the same user to the same variant, preventing cross-contamination. Measure objective metrics: schema success rate (can the output be parsed?), latency, token cost, injection detection rate, and error rate — alongside user satisfaction signals like thumbs-up/down. Run two-proportion z-tests to determine statistical significance; you need ~400+ samples per variant for 80% power at 5% significance with a 5% absolute improvement. Deploy progressively: shadow mode → canary (1%) → limited (20%) → full A/B → 100%. Always define the primary metric before starting, to avoid multiple-hypothesis false positives."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.