A/B Testing Prompts

Why A/B Test Prompts

Offline evals tell you which prompt is better on your eval set. A/B testing tells you which prompt is better on actual production traffic:

Offline eval:
  Your 50-case eval set → Prompt A: 85%, Prompt B: 89%
  Prompt B looks better

But in production:
  Real users send queries you didn't anticipate
  Real users interact differently than your eval cases suggest
  Model behaviour on real-time traffic may differ from your test set

A/B testing confirms that Prompt B is better on the real distribution.

Traffic Splitting Architecture

Python

import random
import hashlib

class PromptRouter:
    def __init__(self, prompts: dict[str, str], weights: dict[str, float]):
        """
        prompts: {"prompt_a": "...", "prompt_b": "..."}
        weights: {"prompt_a": 0.5, "prompt_b": 0.5}
        """
        assert abs(sum(weights.values()) - 1.0) < 1e-6, "Weights must sum to 1"
        self.prompts = prompts
        self.weights = weights

    def get_variant(self, user_id: str) -> str:
        """Deterministic assignment: same user always gets same variant."""
        # Hash user_id to [0, 1)
        h = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        bucket = (h % 10000) / 10000.0

        cumulative = 0.0
        for variant, weight in self.weights.items():
            cumulative += weight
            if bucket < cumulative:
                return variant

        return list(self.prompts.keys())[-1]  # fallback

    def get_prompt(self, user_id: str) -> tuple[str, str]:
        """Returns (variant_name, prompt_text)."""
        variant = self.get_variant(user_id)
        return variant, self.prompts[variant]

# Usage
router = PromptRouter(
    prompts={"control": PROMPT_V1, "treatment": PROMPT_V2},
    weights={"control": 0.8, "treatment": 0.2}  # 80/20 split
)
variant, prompt = router.get_prompt(user_id)

Metrics to Measure

Python

from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class VariantMetrics:
    variant: str
    requests: int = 0
    success_count: int = 0
    latency_ms: list[float] = field(default_factory=list)
    user_ratings: list[int] = field(default_factory=list)  # 1-5 thumbs up/down
    schema_failures: int = 0
    injection_detections: int = 0
    token_count: list[int] = field(default_factory=list)

    @property
    def success_rate(self) -> float:
        return self.success_count / max(self.requests, 1)

    @property
    def mean_latency_ms(self) -> float:
        return sum(self.latency_ms) / max(len(self.latency_ms), 1)

    @property
    def mean_user_rating(self) -> float:
        return sum(self.user_ratings) / max(len(self.user_ratings), 1)

    @property
    def mean_tokens(self) -> float:
        return sum(self.token_count) / max(len(self.token_count), 1)

metrics_store: dict[str, VariantMetrics] = defaultdict(
    lambda: VariantMetrics(variant="unknown")
)

Statistical Significance

Don't make decisions based on small samples:

Python

from scipy import stats
import numpy as np

def is_significant(
    control_successes: int, control_n: int,
    treatment_successes: int, treatment_n: int,
    alpha: float = 0.05
) -> dict:
    """Two-proportion z-test for A/B comparison of success rates."""
    p_control = control_successes / control_n
    p_treatment = treatment_successes / treatment_n

    # Pooled proportion
    p_pool = (control_successes + treatment_successes) / (control_n + treatment_n)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/control_n + 1/treatment_n))
    z = (p_treatment - p_control) / se if se > 0 else 0.0
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # two-tailed

    return {
        "control_rate": p_control,
        "treatment_rate": p_treatment,
        "lift": p_treatment - p_control,
        "relative_lift": (p_treatment - p_control) / max(p_control, 1e-9),
        "p_value": p_value,
        "is_significant": p_value < alpha,
        "z_score": z,
    }

# Example: needs ~400+ samples per variant for 80% power at 5% significance
# with a 5% absolute improvement

Gradual Rollout

Roll out new prompts progressively to limit blast radius:

Stage 1: Shadow mode (0% live traffic)
  Run new prompt alongside current in background
  Compare outputs — no real users affected

Stage 2: Canary (1-5%)
  Small fraction of real traffic
  Monitor for errors, regressions, unexpected output patterns
  Stop immediately if anything looks wrong

Stage 3: Limited rollout (10-20%)
  Enough traffic for meaningful stats in reasonable time
  Monitor metrics daily

Stage 4: Controlled experiment (50/50)
  Full A/B test for statistical significance
  Typically 1-2 weeks of real traffic

Stage 5: Full rollout (100%)
  After statistical significance confirmed
  Keep old prompt available for rollback

Common Pitfalls

1. Multiple hypothesis problem:
   Testing 10 prompt variants simultaneously → 5% chance of false positive each
   → Use Bonferroni correction or pre-register one primary metric

2. Novelty effect:
   Users engage more with new experiences initially
   → Run for at least 1-2 weeks to see steady-state behaviour

3. Selection bias in user ratings:
   Only engaged users rate — may not represent all users
   → Use objective metrics (schema success rate, error rate) alongside ratings

4. Confounded samples:
   Same user gets variant A one day and B the next
   → Assign users consistently via hashed user ID

5. Optimising the wrong metric:
   Success rate might be high but user satisfaction low
   → Track multiple metrics; define primary metric before testing

Interview Answer

"A/B testing prompts in production uses deterministic traffic splitting — hash the user ID to always assign the same user to the same variant, preventing cross-contamination. Measure objective metrics: schema success rate (can the output be parsed?), latency, token cost, injection detection rate, and error rate — alongside user satisfaction signals like thumbs-up/down. Run two-proportion z-tests to determine statistical significance; you need ~400+ samples per variant for 80% power at 5% significance with a 5% absolute improvement. Deploy progressively: shadow mode → canary (1%) → limited (20%) → full A/B → 100%. Always define the primary metric before starting, to avoid multiple-hypothesis false positives."

A/B Testing Prompts

Why A/B Test Prompts

Traffic Splitting Architecture

Metrics to Measure

Statistical Significance

Gradual Rollout

Common Pitfalls

Interview Answer

Enjoyed this article?

Leave a comment