LLM-as-Judge: Using AI to Evaluate AI
Use a stronger LLM to evaluate the quality of another model's outputs. Design effective judge prompts, score on multiple dimensions, and understand the limitations.
Why LLM-as-Judge?
Traditional metrics (BLEU, ROUGE, BERTScore) measure surface similarity to reference answers. LLM-as-judge uses a capable model (GPT-4o, Claude) to evaluate outputs the way a human expert would — on dimensions like accuracy, completeness, clarity, and appropriateness.
This scales human-quality evaluation to thousands of examples.
Single-Criterion Judge
The simplest form: score one response on one criterion:
from openai import OpenAI
import json
client = OpenAI()
def score_response(
question: str,
response: str,
criterion: str,
criterion_description: str,
) -> dict:
"""Score a single response on a single criterion (1-5)."""
prompt = f"""You are evaluating a clinical pharmacology assistant.
Question: {question}
Response being evaluated:
{response}
Criterion: {criterion}
Description: {criterion_description}
Score from 1 to 5:
- 5: Excellent — fully satisfies this criterion
- 4: Good — mostly satisfies with minor gaps
- 3: Adequate — partially satisfies, notable gaps
- 2: Poor — mostly fails this criterion
- 1: Unacceptable — completely fails this criterion
Return JSON only:
{{"score": <1-5>, "reasoning": "one sentence explaining the score"}}"""
response_obj = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(response_obj.choices[0].message.content)
# Usage
result = score_response(
question="What is the drug interaction between warfarin and ibuprofen?",
response="Warfarin and ibuprofen can interact — both increase bleeding risk. Use with caution.",
criterion="clinical_completeness",
criterion_description="Does the response include the mechanism, clinical significance, and management recommendation?",
)
print(f"Score: {result['score']}/5 — {result['reasoning']}")Multi-Criteria Judge
Evaluate multiple dimensions in one call:
CLINICAL_CRITERIA = {
"factual_accuracy": "Is every factual claim in the response medically correct? Are there any errors?",
"clinical_completeness": "Does the response cover mechanism, clinical significance, management, and monitoring as appropriate?",
"appropriate_tone": "Is the tone professional, evidence-based, and suitable for a clinical audience?",
"actionability": "Does the response give the clinician clear, actionable guidance?",
"safety": "Does the response appropriately flag serious risks and avoid potentially harmful advice?",
}
def judge_response_multi_criteria(
question: str,
response: str,
criteria: dict[str, str],
) -> dict:
"""Score a response on multiple criteria simultaneously."""
criteria_text = "\n".join(
f"- {name}: {desc}"
for name, desc in criteria.items()
)
criteria_keys = list(criteria.keys())
example_output = {k: "score" for k in criteria_keys}
example_output["overall"] = "overall_score"
example_output["strengths"] = "..."
example_output["weaknesses"] = "..."
prompt = f"""You are an expert clinical pharmacology evaluator.
Question asked: {question}
Response to evaluate:
{response}
Score this response on each criterion (1-5 where 5=excellent):
{criteria_text}
Return JSON only:
{json.dumps(example_output, indent=2)}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)
# Evaluate a batch
def batch_evaluate(
test_cases: list[dict],
criteria: dict[str, str],
) -> list[dict]:
results = []
for case in test_cases:
judgment = judge_response_multi_criteria(
question=case["question"],
response=case["model_response"],
criteria=criteria,
)
results.append({
"question": case["question"][:60],
**judgment,
})
return resultsReference-Graded Evaluation
When you have a reference answer, include it for grounding:
def grade_against_reference(
question: str,
candidate: str,
reference: str,
) -> dict:
"""Grade candidate response relative to a reference answer."""
prompt = f"""You are grading a clinical pharmacology AI assistant's response.
Question: {question}
Reference answer (expert-written, correct):
{reference}
Candidate response (to be graded):
{candidate}
Evaluate the candidate relative to the reference:
1. Is the candidate factually consistent with the reference?
2. Does the candidate cover the key points from the reference?
3. Does the candidate add any incorrect information?
Return JSON:
{{
"factual_consistency": <1-5>,
"coverage": <1-5>,
"hallucinations": <0 = none, 1 = minor, 2 = major>,
"overall_grade": <1-5>,
"key_differences": "what the candidate missed or got wrong"
}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)Comparative Evaluation (A/B)
Ask the judge which of two responses is better — more reliable than absolute scoring:
def compare_responses(
question: str,
response_a: str,
response_b: str,
) -> dict:
"""Determine which response is better and by how much."""
prompt = f"""Compare two responses to a clinical pharmacology question.
Question: {question}
Response A:
{response_a}
Response B:
{response_b}
Which response is better for a clinical audience? Consider: accuracy, completeness, actionability, and safety.
Return JSON:
{{
"winner": "A" or "B" or "tie",
"confidence": "clear" or "slight" or "marginal",
"reasoning": "2-3 sentences explaining the choice",
"criteria_comparison": {{
"accuracy": "A better / B better / equal",
"completeness": "A better / B better / equal",
"actionability": "A better / B better / equal"
}}
}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
return json.loads(resp.choices[0].message.content)
# Use A/B for fine-tuned vs base model comparison
def evaluate_ab_improvement(
test_cases: list[dict],
base_responses: list[str],
ft_responses: list[str],
) -> dict:
wins = {"A_base": 0, "B_ft": 0, "tie": 0}
for case, base, ft in zip(test_cases, base_responses, ft_responses):
result = compare_responses(case["question"], response_a=base, response_b=ft)
winner = result["winner"]
if winner == "A":
wins["A_base"] += 1
elif winner == "B":
wins["B_ft"] += 1
else:
wins["tie"] += 1
total = len(test_cases)
return {
"base_wins": wins["A_base"],
"ft_wins": wins["B_ft"],
"ties": wins["tie"],
"ft_win_rate": wins["B_ft"] / total,
"total": total,
}Limitations of LLM-as-Judge
Position bias: GPT-4o tends to prefer the response shown first (A) in pairwise comparisons. Mitigate by randomizing order and averaging.
Verbosity bias: Judges tend to prefer longer, more detailed responses even when shorter answers are better. Include explicit instructions to not favor length.
Self-enhancement bias: GPT-4o may rate GPT-4o-generated responses higher. Use Claude or a different model family as judge.
Hallucination detection: LLM judges struggle to detect subtle factual errors in specialized domains. Supplement with domain expert review.
Cost: Judging 1,000 examples with GPT-4o at 500 tokens per evaluation costs roughly $1.50–$3.00. Budget accordingly for large evaluation sets.
For mission-critical evaluation (medical, legal), always include a human review sample alongside LLM-as-judge.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.