Pointwise vs Pairwise Evaluation
Understand the difference between scoring individual responses (pointwise) and comparing two responses directly (pairwise). Learn when each approach is more reliable.
Two Approaches to LLM Evaluation
When evaluating model outputs with a judge (human or LLM), you have two fundamental approaches:
Pointwise: Score each response independently on a numeric scale (e.g., 1–5).
Pairwise: Show two responses side by side and ask which is better.
Each has distinct strengths and failure modes.
Pointwise Evaluation
Score one response at a time on defined criteria:
from openai import OpenAI
import json
client = OpenAI()
def pointwise_score(question: str, response: str) -> dict:
"""Score a single response from 1-5 on multiple criteria."""
prompt = f"""Rate this clinical pharmacology response on three criteria (1-5 each):
Question: {question}
Response: {response}
Criteria:
- accuracy (1=wrong facts, 5=fully accurate)
- completeness (1=missing key info, 5=covers all important aspects)
- clarity (1=confusing, 5=very clear and actionable)
JSON output:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>, "overall": <1-5>}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)
# Evaluate model performance across a test set
scores = [pointwise_score(case["question"], case["response"]) for case in test_cases]
avg_overall = sum(s["overall"] for s in scores) / len(scores)
print(f"Average overall: {avg_overall:.2f}/5")Pointwise strengths:
- Produces absolute scores you can track over time
- Scales well — each example evaluated independently
- Easy to aggregate and compute averages
Pointwise weaknesses:
- Calibration drift — the same score means different things for different judges
- Hard to distinguish between similar-quality responses (both score 4/5)
- Judges anchor on arbitrary numbers and show inconsistency near boundaries
Pairwise Evaluation
Compare two responses directly:
def pairwise_compare(question: str, response_a: str, response_b: str) -> dict:
"""Compare two responses — which is better?"""
prompt = f"""Compare two clinical pharmacology responses.
Question: {question}
Response A:
{response_a}
Response B:
{response_b}
Which response better serves a clinical professional? Consider accuracy, completeness, and actionability.
JSON output:
{{"winner": "A" or "B" or "tie", "margin": "clear" or "slight", "reason": "one sentence"}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)
# A/B test: base model vs fine-tuned model
def run_ab_evaluation(
test_cases: list[dict],
model_a_responses: list[str],
model_b_responses: list[str],
) -> dict:
results = {"A": 0, "B": 0, "tie": 0}
for case, resp_a, resp_b in zip(test_cases, model_a_responses, model_b_responses):
comparison = pairwise_compare(case["question"], resp_a, resp_b)
results[comparison["winner"]] += 1
total = sum(results.values())
return {
"A_win_rate": results["A"] / total,
"B_win_rate": results["B"] / total,
"tie_rate": results["tie"] / total,
"total": total,
}Pairwise strengths:
- More reliable for distinguishing similar-quality models
- Aligns with how humans naturally compare things (relative vs absolute)
- Less sensitive to judge calibration — just need to know which is better
Pairwise weaknesses:
- Doesn't produce absolute quality measure
- Requires O(n²) comparisons for n models (mitigated by choosing which pairs to compare)
- Position bias: judges prefer the response shown first
Controlling for Position Bias in Pairwise
Randomize the order and average the results:
import random
def debiased_pairwise(
question: str,
response_a: str,
response_b: str,
n_evaluations: int = 2,
) -> dict:
"""Run pairwise evaluation with order randomization to reduce position bias."""
votes = {"A": 0, "B": 0, "tie": 0}
for _ in range(n_evaluations):
# Randomly swap order
if random.random() > 0.5:
result = pairwise_compare(question, response_a, response_b)
winner = result["winner"]
else:
result = pairwise_compare(question, response_b, response_a)
# Flip winner back to original labeling
raw_winner = result["winner"]
if raw_winner == "A":
winner = "B"
elif raw_winner == "B":
winner = "A"
else:
winner = "tie"
votes[winner] += 1
# Majority wins
final_winner = max(votes, key=votes.get)
return {
"winner": final_winner,
"votes": votes,
"confident": votes[final_winner] == n_evaluations,
}When to Use Each
| Situation | Recommended | |---|---| | Tracking model quality over time | Pointwise | | Comparing two specific models | Pairwise | | Detecting small performance differences | Pairwise | | Producing a single quality metric | Pointwise | | Evaluating many models against a baseline | Pairwise (each vs baseline) | | Detecting regression after a change | Pointwise (compare averages) |
For model development: use pairwise to decide between model variants (A vs B comparisons). Use pointwise for ongoing quality monitoring (track score trend over product versions).
Combining Both: Bradley-Terry Model
For a robust ranking of multiple models, collect pairwise comparisons and fit a Bradley-Terry model — it converts pairwise wins into a consistent ranking:
from scipy.optimize import minimize
import numpy as np
def bradley_terry_ranking(win_matrix: np.ndarray) -> list[float]:
"""
Fit Bradley-Terry model from win counts.
win_matrix[i][j] = number of times model i beat model j.
Returns log-strength scores (higher is better).
"""
n = win_matrix.shape[0]
def neg_log_likelihood(params):
strength = np.exp(params)
ll = 0
for i in range(n):
for j in range(n):
if i == j:
continue
if win_matrix[i, j] > 0:
ll += win_matrix[i, j] * np.log(strength[i] / (strength[i] + strength[j]))
return -ll
result = minimize(neg_log_likelihood, np.zeros(n), method="L-BFGS-B")
return result.x.tolist() # Log-strengths (higher = better model)
# Example with 3 models
wins = np.array([
[0, 30, 45], # Model 0 beat model 1 30 times, model 2 45 times
[20, 0, 35], # Model 1 beat model 0 20 times, model 2 35 times
[5, 15, 0], # Model 2 beat model 0 5 times, model 1 15 times
])
scores = bradley_terry_ranking(wins)
ranked = sorted(enumerate(scores), key=lambda x: -x[1])
for rank, (model_idx, score) in enumerate(ranked):
print(f"Rank {rank+1}: Model {model_idx} (score: {score:.3f})")This approach underlies Chatbot Arena (LMSYS) — the definitive public LLM ranking based on pairwise human votes.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.