LLMs Deep Dive · Lesson 19 of 24
LLM-as-Judge: Using GPT-4 to Evaluate Outputs
Why LLM-as-Judge
Traditional metrics (BLEU, ROUGE) fail for open-ended generation. Human evaluation is accurate but expensive and slow. LLM-as-judge uses a powerful model (typically GPT-4 or Claude) to evaluate outputs:
Human evaluation:
Gold standard, captures nuance
Cost: $0.50-5.00 per evaluation
Time: hours to days for a benchmark
Reliability: ~80-90% agreement between annotators
LLM-as-judge:
Cost: $0.01-0.10 per evaluation (API call)
Time: minutes
Reliability: 80-85% agreement with human preferences
Scalable to millions of evaluationsSingle-Answer Grading
Ask the judge model to score a single response on a scale (e.g., 1-10):
def grade_single_answer(question: str, answer: str, judge_client) -> dict:
prompt = f"""You are an expert evaluator. Rate the following answer on a scale
of 1-10 for quality, accuracy, and completeness.
Question: {question}
Answer: {answer}
Provide a rating from 1-10 and a brief explanation.
Format: Rating: [score]\nExplanation: [reason]"""
response = judge_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text
# Parse score from "Rating: 8\nExplanation: ..."
import re
match = re.search(r"Rating:\s*(\d+)", text)
score = int(match.group(1)) if match else None
return {"score": score, "explanation": text}Pairwise Comparison
More reliable than absolute scoring — ask which of two responses is better:
def pairwise_compare(question: str, response_a: str, response_b: str, judge_client) -> str:
prompt = f"""Compare two responses to the following question.
Determine which is better (or if they are tied).
Question: {question}
Response A:
{response_a}
Response B:
{response_b}
Which response is better? Reply with exactly one of: A, B, or TIE
Then explain your reasoning in 1-2 sentences."""
response = judge_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
if text.startswith("A"):
return "A"
elif text.startswith("B"):
return "B"
else:
return "TIE"MT-Bench
MT-Bench (Zheng et al., 2023) is a multi-turn benchmark evaluated by GPT-4:
Structure:
80 multi-turn questions across 8 categories:
Writing, Roleplay, Reasoning, Math, Coding,
Extraction, STEM, Humanities
Turn 1: initial question
Turn 2: follow-up requiring consistency with Turn 1
Evaluation:
GPT-4 scores each response 1-10
Reference answers provided for math/coding categories
Both turns scored independently
Results (approximate):
GPT-4: 8.99
Claude 3 Sonnet: 8.57 (estimated)
LLaMA 2 70B chat: 6.86
LLaMA 2 13B chat: 6.27
Vicuna 13B: 6.57Positional Bias
LLM judges have a known bias toward the first or second response in pairwise comparison:
Mitigation: swap order and average
score = judge(question, response_a, response_b)
score_swapped = judge(question, response_b, response_a)
if score == "A" and score_swapped == "B":
winner = "A" (consistent)
elif score == "B" and score_swapped == "A":
winner = "B" (consistent)
elif score == "A" and score_swapped == "A":
winner = "B" (positional bias — B was preferred both times)
elif score == "B" and score_swapped == "B":
winner = "A" (positional bias — A was preferred both times)
else:
winner = "TIE"This doubles the cost but significantly improves reliability.
Verbosity Bias
LLM judges also tend to prefer longer, more detailed responses regardless of accuracy:
Longer response: "There are several factors to consider. First, Warfarin
works by inhibiting Vitamin K epoxide reductase, which reduces the synthesis
of clotting factors II, VII, IX, and X..."
Shorter response: "Warfarin is an anticoagulant that blocks Vitamin K,
reducing clotting factor production."
The longer response may score higher even if the shorter one is more accurate.
Mitigations:
Explicitly instruct the judge to prioritise accuracy over length
Add criteria: "Do not reward verbosity. Prefer concise, accurate answers."
Normalise by length in your promptMedical LLM Evaluation
For clinical applications, LLM-as-judge requires domain-specific rubrics:
Medical accuracy rubric:
10: Clinically accurate, safe, complete
8: Accurate with minor omissions
6: Mostly accurate, one factual error
4: Significant inaccuracies, potentially harmful
2: Dangerous or fundamentally wrong
1: Do not use
Additional dimensions:
Harm avoidance: does the response avoid dangerous recommendations?
Citation: does it recommend consulting a clinician?
Uncertainty: does it acknowledge what it doesn't know?
Use a clinical expert as the judge (fine-tuned medical model or
GPT-4 with detailed medical system prompt + expert review).Interview Answer
"LLM-as-judge uses a capable model (GPT-4, Claude) to evaluate LLM outputs — either grading single responses 1-10 or comparing pairs. It achieves 80-85% agreement with human preferences at a fraction of the cost. MT-Bench is a standard 80-question multi-turn benchmark with GPT-4 as the judge. Key biases: positional bias (preference for the first or second response — mitigate by swapping and aggregating), verbosity bias (preference for longer responses — mitigate with explicit instructions). For medical AI evaluation, extend the rubric with clinical accuracy, harm avoidance, and uncertainty acknowledgment dimensions."