Pointwise vs Pairwise Evaluation

Two Approaches to LLM Evaluation

When evaluating model outputs with a judge (human or LLM), you have two fundamental approaches:

Pointwise: Score each response independently on a numeric scale (e.g., 1–5).

Pairwise: Show two responses side by side and ask which is better.

Each has distinct strengths and failure modes.

Pointwise Evaluation

Score one response at a time on defined criteria:

Python

from openai import OpenAI
import json

client = OpenAI()

def pointwise_score(question: str, response: str) -> dict:
    """Score a single response from 1-5 on multiple criteria."""
    prompt = f"""Rate this clinical pharmacology response on three criteria (1-5 each):

Question: {question}
Response: {response}

Criteria:
- accuracy (1=wrong facts, 5=fully accurate)
- completeness (1=missing key info, 5=covers all important aspects)
- clarity (1=confusing, 5=very clear and actionable)

JSON output:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>, "overall": <1-5>}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(resp.choices[0].message.content)

# Evaluate model performance across a test set
scores = [pointwise_score(case["question"], case["response"]) for case in test_cases]
avg_overall = sum(s["overall"] for s in scores) / len(scores)
print(f"Average overall: {avg_overall:.2f}/5")

Pointwise strengths:

Produces absolute scores you can track over time
Scales well — each example evaluated independently
Easy to aggregate and compute averages

Pointwise weaknesses:

Calibration drift — the same score means different things for different judges
Hard to distinguish between similar-quality responses (both score 4/5)
Judges anchor on arbitrary numbers and show inconsistency near boundaries

Pairwise Evaluation

Compare two responses directly:

Python

def pairwise_compare(question: str, response_a: str, response_b: str) -> dict:
    """Compare two responses — which is better?"""
    prompt = f"""Compare two clinical pharmacology responses.

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Which response better serves a clinical professional? Consider accuracy, completeness, and actionability.

JSON output:
{{"winner": "A" or "B" or "tie", "margin": "clear" or "slight", "reason": "one sentence"}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(resp.choices[0].message.content)

# A/B test: base model vs fine-tuned model
def run_ab_evaluation(
    test_cases: list[dict],
    model_a_responses: list[str],
    model_b_responses: list[str],
) -> dict:
    results = {"A": 0, "B": 0, "tie": 0}

    for case, resp_a, resp_b in zip(test_cases, model_a_responses, model_b_responses):
        comparison = pairwise_compare(case["question"], resp_a, resp_b)
        results[comparison["winner"]] += 1

    total = sum(results.values())
    return {
        "A_win_rate": results["A"] / total,
        "B_win_rate": results["B"] / total,
        "tie_rate": results["tie"] / total,
        "total": total,
    }

Pairwise strengths:

More reliable for distinguishing similar-quality models
Aligns with how humans naturally compare things (relative vs absolute)
Less sensitive to judge calibration — just need to know which is better

Pairwise weaknesses:

Doesn't produce absolute quality measure
Requires O(n²) comparisons for n models (mitigated by choosing which pairs to compare)
Position bias: judges prefer the response shown first

Controlling for Position Bias in Pairwise

Randomize the order and average the results:

Python

import random

def debiased_pairwise(
    question: str,
    response_a: str,
    response_b: str,
    n_evaluations: int = 2,
) -> dict:
    """Run pairwise evaluation with order randomization to reduce position bias."""
    votes = {"A": 0, "B": 0, "tie": 0}

    for _ in range(n_evaluations):
        # Randomly swap order
        if random.random() > 0.5:
            result = pairwise_compare(question, response_a, response_b)
            winner = result["winner"]
        else:
            result = pairwise_compare(question, response_b, response_a)
            # Flip winner back to original labeling
            raw_winner = result["winner"]
            if raw_winner == "A":
                winner = "B"
            elif raw_winner == "B":
                winner = "A"
            else:
                winner = "tie"

        votes[winner] += 1

    # Majority wins
    final_winner = max(votes, key=votes.get)
    return {
        "winner": final_winner,
        "votes": votes,
        "confident": votes[final_winner] == n_evaluations,
    }

When to Use Each

| Situation | Recommended | |---|---| | Tracking model quality over time | Pointwise | | Comparing two specific models | Pairwise | | Detecting small performance differences | Pairwise | | Producing a single quality metric | Pointwise | | Evaluating many models against a baseline | Pairwise (each vs baseline) | | Detecting regression after a change | Pointwise (compare averages) |

For model development: use pairwise to decide between model variants (A vs B comparisons). Use pointwise for ongoing quality monitoring (track score trend over product versions).

Combining Both: Bradley-Terry Model

For a robust ranking of multiple models, collect pairwise comparisons and fit a Bradley-Terry model — it converts pairwise wins into a consistent ranking:

Python

from scipy.optimize import minimize
import numpy as np

def bradley_terry_ranking(win_matrix: np.ndarray) -> list[float]:
    """
    Fit Bradley-Terry model from win counts.
    win_matrix[i][j] = number of times model i beat model j.
    Returns log-strength scores (higher is better).
    """
    n = win_matrix.shape[0]

    def neg_log_likelihood(params):
        strength = np.exp(params)
        ll = 0
        for i in range(n):
            for j in range(n):
                if i == j:
                    continue
                if win_matrix[i, j] > 0:
                    ll += win_matrix[i, j] * np.log(strength[i] / (strength[i] + strength[j]))
        return -ll

    result = minimize(neg_log_likelihood, np.zeros(n), method="L-BFGS-B")
    return result.x.tolist()  # Log-strengths (higher = better model)

# Example with 3 models
wins = np.array([
    [0, 30, 45],   # Model 0 beat model 1 30 times, model 2 45 times
    [20, 0, 35],   # Model 1 beat model 0 20 times, model 2 35 times
    [5, 15, 0],    # Model 2 beat model 0 5 times, model 1 15 times
])

scores = bradley_terry_ranking(wins)
ranked = sorted(enumerate(scores), key=lambda x: -x[1])
for rank, (model_idx, score) in enumerate(ranked):
    print(f"Rank {rank+1}: Model {model_idx} (score: {score:.3f})")

This approach underlies Chatbot Arena (LMSYS) — the definitive public LLM ranking based on pairwise human votes.

Pointwise vs Pairwise Evaluation

Two Approaches to LLM Evaluation

Pointwise Evaluation

Pairwise Evaluation

Controlling for Position Bias in Pairwise

When to Use Each

Combining Both: Bradley-Terry Model

Enjoyed this article?

Leave a comment