Learnixo

LLM Evaluation Q&A · Lesson 10 of 16

Pointwise vs Pairwise Evaluation

Two Approaches to LLM Evaluation

When evaluating model outputs with a judge (human or LLM), you have two fundamental approaches:

Pointwise: Score each response independently on a numeric scale (e.g., 1–5).

Pairwise: Show two responses side by side and ask which is better.

Each has distinct strengths and failure modes.


Pointwise Evaluation

Score one response at a time on defined criteria:

Python
from openai import OpenAI
import json

client = OpenAI()

def pointwise_score(question: str, response: str) -> dict:
    """Score a single response from 1-5 on multiple criteria."""
    prompt = f"""Rate this clinical pharmacology response on three criteria (1-5 each):

Question: {question}
Response: {response}

Criteria:
- accuracy (1=wrong facts, 5=fully accurate)
- completeness (1=missing key info, 5=covers all important aspects)
- clarity (1=confusing, 5=very clear and actionable)

JSON output:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>, "overall": <1-5>}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(resp.choices[0].message.content)

# Evaluate model performance across a test set
scores = [pointwise_score(case["question"], case["response"]) for case in test_cases]
avg_overall = sum(s["overall"] for s in scores) / len(scores)
print(f"Average overall: {avg_overall:.2f}/5")

Pointwise strengths:

  • Produces absolute scores you can track over time
  • Scales well — each example evaluated independently
  • Easy to aggregate and compute averages

Pointwise weaknesses:

  • Calibration drift — the same score means different things for different judges
  • Hard to distinguish between similar-quality responses (both score 4/5)
  • Judges anchor on arbitrary numbers and show inconsistency near boundaries

Pairwise Evaluation

Compare two responses directly:

Python
def pairwise_compare(question: str, response_a: str, response_b: str) -> dict:
    """Compare two responses — which is better?"""
    prompt = f"""Compare two clinical pharmacology responses.

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Which response better serves a clinical professional? Consider accuracy, completeness, and actionability.

JSON output:
{{"winner": "A" or "B" or "tie", "margin": "clear" or "slight", "reason": "one sentence"}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(resp.choices[0].message.content)

# A/B test: base model vs fine-tuned model
def run_ab_evaluation(
    test_cases: list[dict],
    model_a_responses: list[str],
    model_b_responses: list[str],
) -> dict:
    results = {"A": 0, "B": 0, "tie": 0}

    for case, resp_a, resp_b in zip(test_cases, model_a_responses, model_b_responses):
        comparison = pairwise_compare(case["question"], resp_a, resp_b)
        results[comparison["winner"]] += 1

    total = sum(results.values())
    return {
        "A_win_rate": results["A"] / total,
        "B_win_rate": results["B"] / total,
        "tie_rate": results["tie"] / total,
        "total": total,
    }

Pairwise strengths:

  • More reliable for distinguishing similar-quality models
  • Aligns with how humans naturally compare things (relative vs absolute)
  • Less sensitive to judge calibration — just need to know which is better

Pairwise weaknesses:

  • Doesn't produce absolute quality measure
  • Requires O(n²) comparisons for n models (mitigated by choosing which pairs to compare)
  • Position bias: judges prefer the response shown first

Controlling for Position Bias in Pairwise

Randomize the order and average the results:

Python
import random

def debiased_pairwise(
    question: str,
    response_a: str,
    response_b: str,
    n_evaluations: int = 2,
) -> dict:
    """Run pairwise evaluation with order randomization to reduce position bias."""
    votes = {"A": 0, "B": 0, "tie": 0}

    for _ in range(n_evaluations):
        # Randomly swap order
        if random.random() > 0.5:
            result = pairwise_compare(question, response_a, response_b)
            winner = result["winner"]
        else:
            result = pairwise_compare(question, response_b, response_a)
            # Flip winner back to original labeling
            raw_winner = result["winner"]
            if raw_winner == "A":
                winner = "B"
            elif raw_winner == "B":
                winner = "A"
            else:
                winner = "tie"

        votes[winner] += 1

    # Majority wins
    final_winner = max(votes, key=votes.get)
    return {
        "winner": final_winner,
        "votes": votes,
        "confident": votes[final_winner] == n_evaluations,
    }

When to Use Each

| Situation | Recommended | |---|---| | Tracking model quality over time | Pointwise | | Comparing two specific models | Pairwise | | Detecting small performance differences | Pairwise | | Producing a single quality metric | Pointwise | | Evaluating many models against a baseline | Pairwise (each vs baseline) | | Detecting regression after a change | Pointwise (compare averages) |

For model development: use pairwise to decide between model variants (A vs B comparisons). Use pointwise for ongoing quality monitoring (track score trend over product versions).


Combining Both: Bradley-Terry Model

For a robust ranking of multiple models, collect pairwise comparisons and fit a Bradley-Terry model — it converts pairwise wins into a consistent ranking:

Python
from scipy.optimize import minimize
import numpy as np

def bradley_terry_ranking(win_matrix: np.ndarray) -> list[float]:
    """
    Fit Bradley-Terry model from win counts.
    win_matrix[i][j] = number of times model i beat model j.
    Returns log-strength scores (higher is better).
    """
    n = win_matrix.shape[0]

    def neg_log_likelihood(params):
        strength = np.exp(params)
        ll = 0
        for i in range(n):
            for j in range(n):
                if i == j:
                    continue
                if win_matrix[i, j] > 0:
                    ll += win_matrix[i, j] * np.log(strength[i] / (strength[i] + strength[j]))
        return -ll

    result = minimize(neg_log_likelihood, np.zeros(n), method="L-BFGS-B")
    return result.x.tolist()  # Log-strengths (higher = better model)

# Example with 3 models
wins = np.array([
    [0, 30, 45],   # Model 0 beat model 1 30 times, model 2 45 times
    [20, 0, 35],   # Model 1 beat model 0 20 times, model 2 35 times
    [5, 15, 0],    # Model 2 beat model 0 5 times, model 1 15 times
])

scores = bradley_terry_ranking(wins)
ranked = sorted(enumerate(scores), key=lambda x: -x[1])
for rank, (model_idx, score) in enumerate(ranked):
    print(f"Rank {rank+1}: Model {model_idx} (score: {score:.3f})")

This approach underlies Chatbot Arena (LMSYS) — the definitive public LLM ranking based on pairwise human votes.