Learnixo
Back to blog
AI Systemsintermediate

LLM-as-Judge

Using a capable LLM to evaluate other LLM outputs — single-answer grading, pairwise comparison, the MT-Bench framework, and reliability considerations.

Asma Hafeez KhanMay 16, 20264 min read
LLMsEvaluationLLM-as-JudgeMT-BenchInterview
Share:𝕏

Why LLM-as-Judge

Traditional metrics (BLEU, ROUGE) fail for open-ended generation. Human evaluation is accurate but expensive and slow. LLM-as-judge uses a powerful model (typically GPT-4 or Claude) to evaluate outputs:

Human evaluation:
  Gold standard, captures nuance
  Cost: $0.50-5.00 per evaluation
  Time: hours to days for a benchmark
  Reliability: ~80-90% agreement between annotators

LLM-as-judge:
  Cost: $0.01-0.10 per evaluation (API call)
  Time: minutes
  Reliability: 80-85% agreement with human preferences
  Scalable to millions of evaluations

Single-Answer Grading

Ask the judge model to score a single response on a scale (e.g., 1-10):

Python
def grade_single_answer(question: str, answer: str, judge_client) -> dict:
    prompt = f"""You are an expert evaluator. Rate the following answer on a scale
of 1-10 for quality, accuracy, and completeness.

Question: {question}

Answer: {answer}

Provide a rating from 1-10 and a brief explanation.
Format: Rating: [score]\nExplanation: [reason]"""

    response = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )

    text = response.content[0].text
    # Parse score from "Rating: 8\nExplanation: ..."
    import re
    match = re.search(r"Rating:\s*(\d+)", text)
    score = int(match.group(1)) if match else None
    return {"score": score, "explanation": text}

Pairwise Comparison

More reliable than absolute scoring — ask which of two responses is better:

Python
def pairwise_compare(question: str, response_a: str, response_b: str, judge_client) -> str:
    prompt = f"""Compare two responses to the following question.
Determine which is better (or if they are tied).

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Which response is better? Reply with exactly one of: A, B, or TIE
Then explain your reasoning in 1-2 sentences."""

    response = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.content[0].text.strip()

    if text.startswith("A"):
        return "A"
    elif text.startswith("B"):
        return "B"
    else:
        return "TIE"

MT-Bench

MT-Bench (Zheng et al., 2023) is a multi-turn benchmark evaluated by GPT-4:

Structure:
  80 multi-turn questions across 8 categories:
    Writing, Roleplay, Reasoning, Math, Coding,
    Extraction, STEM, Humanities

  Turn 1: initial question
  Turn 2: follow-up requiring consistency with Turn 1

Evaluation:
  GPT-4 scores each response 1-10
  Reference answers provided for math/coding categories
  Both turns scored independently

Results (approximate):
  GPT-4:       8.99
  Claude 3 Sonnet: 8.57 (estimated)
  LLaMA 2 70B chat: 6.86
  LLaMA 2 13B chat: 6.27
  Vicuna 13B:  6.57

Positional Bias

LLM judges have a known bias toward the first or second response in pairwise comparison:

Mitigation: swap order and average

score = judge(question, response_a, response_b)
score_swapped = judge(question, response_b, response_a)

if score == "A" and score_swapped == "B":
    winner = "A"  (consistent)
elif score == "B" and score_swapped == "A":
    winner = "B"  (consistent)
elif score == "A" and score_swapped == "A":
    winner = "B"  (positional bias — B was preferred both times)
elif score == "B" and score_swapped == "B":
    winner = "A"  (positional bias — A was preferred both times)
else:
    winner = "TIE"

This doubles the cost but significantly improves reliability.


Verbosity Bias

LLM judges also tend to prefer longer, more detailed responses regardless of accuracy:

Longer response: "There are several factors to consider. First, Warfarin
works by inhibiting Vitamin K epoxide reductase, which reduces the synthesis
of clotting factors II, VII, IX, and X..."

Shorter response: "Warfarin is an anticoagulant that blocks Vitamin K,
reducing clotting factor production."

The longer response may score higher even if the shorter one is more accurate.

Mitigations:
  Explicitly instruct the judge to prioritise accuracy over length
  Add criteria: "Do not reward verbosity. Prefer concise, accurate answers."
  Normalise by length in your prompt

Medical LLM Evaluation

For clinical applications, LLM-as-judge requires domain-specific rubrics:

Medical accuracy rubric:
  10: Clinically accurate, safe, complete
   8: Accurate with minor omissions
   6: Mostly accurate, one factual error
   4: Significant inaccuracies, potentially harmful
   2: Dangerous or fundamentally wrong
   1: Do not use

Additional dimensions:
  Harm avoidance: does the response avoid dangerous recommendations?
  Citation: does it recommend consulting a clinician?
  Uncertainty: does it acknowledge what it doesn't know?

Use a clinical expert as the judge (fine-tuned medical model or
GPT-4 with detailed medical system prompt + expert review).

Interview Answer

"LLM-as-judge uses a capable model (GPT-4, Claude) to evaluate LLM outputs — either grading single responses 1-10 or comparing pairs. It achieves 80-85% agreement with human preferences at a fraction of the cost. MT-Bench is a standard 80-question multi-turn benchmark with GPT-4 as the judge. Key biases: positional bias (preference for the first or second response — mitigate by swapping and aggregating), verbosity bias (preference for longer responses — mitigate with explicit instructions). For medical AI evaluation, extend the rubric with clinical accuracy, harm avoidance, and uncertainty acknowledgment dimensions."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.