LLM Evaluation Q&A · Lesson 10 of 16
Pointwise vs Pairwise Evaluation
Two Approaches to LLM Evaluation
When evaluating model outputs with a judge (human or LLM), you have two fundamental approaches:
Pointwise: Score each response independently on a numeric scale (e.g., 1–5).
Pairwise: Show two responses side by side and ask which is better.
Each has distinct strengths and failure modes.
Pointwise Evaluation
Score one response at a time on defined criteria:
from openai import OpenAI
import json
client = OpenAI()
def pointwise_score(question: str, response: str) -> dict:
"""Score a single response from 1-5 on multiple criteria."""
prompt = f"""Rate this clinical pharmacology response on three criteria (1-5 each):
Question: {question}
Response: {response}
Criteria:
- accuracy (1=wrong facts, 5=fully accurate)
- completeness (1=missing key info, 5=covers all important aspects)
- clarity (1=confusing, 5=very clear and actionable)
JSON output:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>, "overall": <1-5>}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)
# Evaluate model performance across a test set
scores = [pointwise_score(case["question"], case["response"]) for case in test_cases]
avg_overall = sum(s["overall"] for s in scores) / len(scores)
print(f"Average overall: {avg_overall:.2f}/5")Pointwise strengths:
- Produces absolute scores you can track over time
- Scales well — each example evaluated independently
- Easy to aggregate and compute averages
Pointwise weaknesses:
- Calibration drift — the same score means different things for different judges
- Hard to distinguish between similar-quality responses (both score 4/5)
- Judges anchor on arbitrary numbers and show inconsistency near boundaries
Pairwise Evaluation
Compare two responses directly:
def pairwise_compare(question: str, response_a: str, response_b: str) -> dict:
"""Compare two responses — which is better?"""
prompt = f"""Compare two clinical pharmacology responses.
Question: {question}
Response A:
{response_a}
Response B:
{response_b}
Which response better serves a clinical professional? Consider accuracy, completeness, and actionability.
JSON output:
{{"winner": "A" or "B" or "tie", "margin": "clear" or "slight", "reason": "one sentence"}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)
# A/B test: base model vs fine-tuned model
def run_ab_evaluation(
test_cases: list[dict],
model_a_responses: list[str],
model_b_responses: list[str],
) -> dict:
results = {"A": 0, "B": 0, "tie": 0}
for case, resp_a, resp_b in zip(test_cases, model_a_responses, model_b_responses):
comparison = pairwise_compare(case["question"], resp_a, resp_b)
results[comparison["winner"]] += 1
total = sum(results.values())
return {
"A_win_rate": results["A"] / total,
"B_win_rate": results["B"] / total,
"tie_rate": results["tie"] / total,
"total": total,
}Pairwise strengths:
- More reliable for distinguishing similar-quality models
- Aligns with how humans naturally compare things (relative vs absolute)
- Less sensitive to judge calibration — just need to know which is better
Pairwise weaknesses:
- Doesn't produce absolute quality measure
- Requires O(n²) comparisons for n models (mitigated by choosing which pairs to compare)
- Position bias: judges prefer the response shown first
Controlling for Position Bias in Pairwise
Randomize the order and average the results:
import random
def debiased_pairwise(
question: str,
response_a: str,
response_b: str,
n_evaluations: int = 2,
) -> dict:
"""Run pairwise evaluation with order randomization to reduce position bias."""
votes = {"A": 0, "B": 0, "tie": 0}
for _ in range(n_evaluations):
# Randomly swap order
if random.random() > 0.5:
result = pairwise_compare(question, response_a, response_b)
winner = result["winner"]
else:
result = pairwise_compare(question, response_b, response_a)
# Flip winner back to original labeling
raw_winner = result["winner"]
if raw_winner == "A":
winner = "B"
elif raw_winner == "B":
winner = "A"
else:
winner = "tie"
votes[winner] += 1
# Majority wins
final_winner = max(votes, key=votes.get)
return {
"winner": final_winner,
"votes": votes,
"confident": votes[final_winner] == n_evaluations,
}When to Use Each
| Situation | Recommended | |---|---| | Tracking model quality over time | Pointwise | | Comparing two specific models | Pairwise | | Detecting small performance differences | Pairwise | | Producing a single quality metric | Pointwise | | Evaluating many models against a baseline | Pairwise (each vs baseline) | | Detecting regression after a change | Pointwise (compare averages) |
For model development: use pairwise to decide between model variants (A vs B comparisons). Use pointwise for ongoing quality monitoring (track score trend over product versions).
Combining Both: Bradley-Terry Model
For a robust ranking of multiple models, collect pairwise comparisons and fit a Bradley-Terry model — it converts pairwise wins into a consistent ranking:
from scipy.optimize import minimize
import numpy as np
def bradley_terry_ranking(win_matrix: np.ndarray) -> list[float]:
"""
Fit Bradley-Terry model from win counts.
win_matrix[i][j] = number of times model i beat model j.
Returns log-strength scores (higher is better).
"""
n = win_matrix.shape[0]
def neg_log_likelihood(params):
strength = np.exp(params)
ll = 0
for i in range(n):
for j in range(n):
if i == j:
continue
if win_matrix[i, j] > 0:
ll += win_matrix[i, j] * np.log(strength[i] / (strength[i] + strength[j]))
return -ll
result = minimize(neg_log_likelihood, np.zeros(n), method="L-BFGS-B")
return result.x.tolist() # Log-strengths (higher = better model)
# Example with 3 models
wins = np.array([
[0, 30, 45], # Model 0 beat model 1 30 times, model 2 45 times
[20, 0, 35], # Model 1 beat model 0 20 times, model 2 35 times
[5, 15, 0], # Model 2 beat model 0 5 times, model 1 15 times
])
scores = bradley_terry_ranking(wins)
ranked = sorted(enumerate(scores), key=lambda x: -x[1])
for rank, (model_idx, score) in enumerate(ranked):
print(f"Rank {rank+1}: Model {model_idx} (score: {score:.3f})")This approach underlies Chatbot Arena (LMSYS) — the definitive public LLM ranking based on pairwise human votes.