Learnixo

LLMs Deep Dive · Lesson 19 of 24

LLM-as-Judge: Using GPT-4 to Evaluate Outputs

Why LLM-as-Judge

Traditional metrics (BLEU, ROUGE) fail for open-ended generation. Human evaluation is accurate but expensive and slow. LLM-as-judge uses a powerful model (typically GPT-4 or Claude) to evaluate outputs:

Human evaluation:
  Gold standard, captures nuance
  Cost: $0.50-5.00 per evaluation
  Time: hours to days for a benchmark
  Reliability: ~80-90% agreement between annotators

LLM-as-judge:
  Cost: $0.01-0.10 per evaluation (API call)
  Time: minutes
  Reliability: 80-85% agreement with human preferences
  Scalable to millions of evaluations

Single-Answer Grading

Ask the judge model to score a single response on a scale (e.g., 1-10):

Python
def grade_single_answer(question: str, answer: str, judge_client) -> dict:
    prompt = f"""You are an expert evaluator. Rate the following answer on a scale
of 1-10 for quality, accuracy, and completeness.

Question: {question}

Answer: {answer}

Provide a rating from 1-10 and a brief explanation.
Format: Rating: [score]\nExplanation: [reason]"""

    response = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )

    text = response.content[0].text
    # Parse score from "Rating: 8\nExplanation: ..."
    import re
    match = re.search(r"Rating:\s*(\d+)", text)
    score = int(match.group(1)) if match else None
    return {"score": score, "explanation": text}

Pairwise Comparison

More reliable than absolute scoring — ask which of two responses is better:

Python
def pairwise_compare(question: str, response_a: str, response_b: str, judge_client) -> str:
    prompt = f"""Compare two responses to the following question.
Determine which is better (or if they are tied).

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Which response is better? Reply with exactly one of: A, B, or TIE
Then explain your reasoning in 1-2 sentences."""

    response = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.content[0].text.strip()

    if text.startswith("A"):
        return "A"
    elif text.startswith("B"):
        return "B"
    else:
        return "TIE"

MT-Bench

MT-Bench (Zheng et al., 2023) is a multi-turn benchmark evaluated by GPT-4:

Structure:
  80 multi-turn questions across 8 categories:
    Writing, Roleplay, Reasoning, Math, Coding,
    Extraction, STEM, Humanities

  Turn 1: initial question
  Turn 2: follow-up requiring consistency with Turn 1

Evaluation:
  GPT-4 scores each response 1-10
  Reference answers provided for math/coding categories
  Both turns scored independently

Results (approximate):
  GPT-4:       8.99
  Claude 3 Sonnet: 8.57 (estimated)
  LLaMA 2 70B chat: 6.86
  LLaMA 2 13B chat: 6.27
  Vicuna 13B:  6.57

Positional Bias

LLM judges have a known bias toward the first or second response in pairwise comparison:

Mitigation: swap order and average

score = judge(question, response_a, response_b)
score_swapped = judge(question, response_b, response_a)

if score == "A" and score_swapped == "B":
    winner = "A"  (consistent)
elif score == "B" and score_swapped == "A":
    winner = "B"  (consistent)
elif score == "A" and score_swapped == "A":
    winner = "B"  (positional bias — B was preferred both times)
elif score == "B" and score_swapped == "B":
    winner = "A"  (positional bias — A was preferred both times)
else:
    winner = "TIE"

This doubles the cost but significantly improves reliability.


Verbosity Bias

LLM judges also tend to prefer longer, more detailed responses regardless of accuracy:

Longer response: "There are several factors to consider. First, Warfarin
works by inhibiting Vitamin K epoxide reductase, which reduces the synthesis
of clotting factors II, VII, IX, and X..."

Shorter response: "Warfarin is an anticoagulant that blocks Vitamin K,
reducing clotting factor production."

The longer response may score higher even if the shorter one is more accurate.

Mitigations:
  Explicitly instruct the judge to prioritise accuracy over length
  Add criteria: "Do not reward verbosity. Prefer concise, accurate answers."
  Normalise by length in your prompt

Medical LLM Evaluation

For clinical applications, LLM-as-judge requires domain-specific rubrics:

Medical accuracy rubric:
  10: Clinically accurate, safe, complete
   8: Accurate with minor omissions
   6: Mostly accurate, one factual error
   4: Significant inaccuracies, potentially harmful
   2: Dangerous or fundamentally wrong
   1: Do not use

Additional dimensions:
  Harm avoidance: does the response avoid dangerous recommendations?
  Citation: does it recommend consulting a clinician?
  Uncertainty: does it acknowledge what it doesn't know?

Use a clinical expert as the judge (fine-tuned medical model or
GPT-4 with detailed medical system prompt + expert review).

Interview Answer

"LLM-as-judge uses a capable model (GPT-4, Claude) to evaluate LLM outputs — either grading single responses 1-10 or comparing pairs. It achieves 80-85% agreement with human preferences at a fraction of the cost. MT-Bench is a standard 80-question multi-turn benchmark with GPT-4 as the judge. Key biases: positional bias (preference for the first or second response — mitigate by swapping and aggregating), verbosity bias (preference for longer responses — mitigate with explicit instructions). For medical AI evaluation, extend the rubric with clinical accuracy, harm avoidance, and uncertainty acknowledgment dimensions."