Learnixo

LLM Evaluation Q&A · Lesson 9 of 16

LLM-as-Judge: Using GPT-4 to Evaluate Outputs

Why LLM-as-Judge?

Traditional metrics (BLEU, ROUGE, BERTScore) measure surface similarity to reference answers. LLM-as-judge uses a capable model (GPT-4o, Claude) to evaluate outputs the way a human expert would — on dimensions like accuracy, completeness, clarity, and appropriateness.

This scales human-quality evaluation to thousands of examples.


Single-Criterion Judge

The simplest form: score one response on one criterion:

Python
from openai import OpenAI
import json

client = OpenAI()

def score_response(
    question: str,
    response: str,
    criterion: str,
    criterion_description: str,
) -> dict:
    """Score a single response on a single criterion (1-5)."""

    prompt = f"""You are evaluating a clinical pharmacology assistant.

Question: {question}

Response being evaluated:
{response}

Criterion: {criterion}
Description: {criterion_description}

Score from 1 to 5:
- 5: Excellent — fully satisfies this criterion
- 4: Good — mostly satisfies with minor gaps
- 3: Adequate — partially satisfies, notable gaps
- 2: Poor — mostly fails this criterion
- 1: Unacceptable — completely fails this criterion

Return JSON only:
{{"score": <1-5>, "reasoning": "one sentence explaining the score"}}"""

    response_obj = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )

    return json.loads(response_obj.choices[0].message.content)

# Usage
result = score_response(
    question="What is the drug interaction between warfarin and ibuprofen?",
    response="Warfarin and ibuprofen can interact — both increase bleeding risk. Use with caution.",
    criterion="clinical_completeness",
    criterion_description="Does the response include the mechanism, clinical significance, and management recommendation?",
)
print(f"Score: {result['score']}/5 — {result['reasoning']}")

Multi-Criteria Judge

Evaluate multiple dimensions in one call:

Python
CLINICAL_CRITERIA = {
    "factual_accuracy": "Is every factual claim in the response medically correct? Are there any errors?",
    "clinical_completeness": "Does the response cover mechanism, clinical significance, management, and monitoring as appropriate?",
    "appropriate_tone": "Is the tone professional, evidence-based, and suitable for a clinical audience?",
    "actionability": "Does the response give the clinician clear, actionable guidance?",
    "safety": "Does the response appropriately flag serious risks and avoid potentially harmful advice?",
}

def judge_response_multi_criteria(
    question: str,
    response: str,
    criteria: dict[str, str],
) -> dict:
    """Score a response on multiple criteria simultaneously."""

    criteria_text = "\n".join(
        f"- {name}: {desc}"
        for name, desc in criteria.items()
    )

    criteria_keys = list(criteria.keys())
    example_output = {k: "score" for k in criteria_keys}
    example_output["overall"] = "overall_score"
    example_output["strengths"] = "..."
    example_output["weaknesses"] = "..."

    prompt = f"""You are an expert clinical pharmacology evaluator.

Question asked: {question}

Response to evaluate:
{response}

Score this response on each criterion (1-5 where 5=excellent):
{criteria_text}

Return JSON only:
{json.dumps(example_output, indent=2)}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )

    return json.loads(resp.choices[0].message.content)

# Evaluate a batch
def batch_evaluate(
    test_cases: list[dict],
    criteria: dict[str, str],
) -> list[dict]:
    results = []
    for case in test_cases:
        judgment = judge_response_multi_criteria(
            question=case["question"],
            response=case["model_response"],
            criteria=criteria,
        )
        results.append({
            "question": case["question"][:60],
            **judgment,
        })
    return results

Reference-Graded Evaluation

When you have a reference answer, include it for grounding:

Python
def grade_against_reference(
    question: str,
    candidate: str,
    reference: str,
) -> dict:
    """Grade candidate response relative to a reference answer."""

    prompt = f"""You are grading a clinical pharmacology AI assistant's response.

Question: {question}

Reference answer (expert-written, correct):
{reference}

Candidate response (to be graded):
{candidate}

Evaluate the candidate relative to the reference:
1. Is the candidate factually consistent with the reference?
2. Does the candidate cover the key points from the reference?
3. Does the candidate add any incorrect information?

Return JSON:
{{
  "factual_consistency": <1-5>,
  "coverage": <1-5>,
  "hallucinations": <0 = none, 1 = minor, 2 = major>,
  "overall_grade": <1-5>,
  "key_differences": "what the candidate missed or got wrong"
}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(resp.choices[0].message.content)

Comparative Evaluation (A/B)

Ask the judge which of two responses is better — more reliable than absolute scoring:

Python
def compare_responses(
    question: str,
    response_a: str,
    response_b: str,
) -> dict:
    """Determine which response is better and by how much."""

    prompt = f"""Compare two responses to a clinical pharmacology question.

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Which response is better for a clinical audience? Consider: accuracy, completeness, actionability, and safety.

Return JSON:
{{
  "winner": "A" or "B" or "tie",
  "confidence": "clear" or "slight" or "marginal",
  "reasoning": "2-3 sentences explaining the choice",
  "criteria_comparison": {{
    "accuracy": "A better / B better / equal",
    "completeness": "A better / B better / equal",
    "actionability": "A better / B better / equal"
  }}
}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(resp.choices[0].message.content)

# Use A/B for fine-tuned vs base model comparison
def evaluate_ab_improvement(
    test_cases: list[dict],
    base_responses: list[str],
    ft_responses: list[str],
) -> dict:
    wins = {"A_base": 0, "B_ft": 0, "tie": 0}

    for case, base, ft in zip(test_cases, base_responses, ft_responses):
        result = compare_responses(case["question"], response_a=base, response_b=ft)
        winner = result["winner"]
        if winner == "A":
            wins["A_base"] += 1
        elif winner == "B":
            wins["B_ft"] += 1
        else:
            wins["tie"] += 1

    total = len(test_cases)
    return {
        "base_wins": wins["A_base"],
        "ft_wins": wins["B_ft"],
        "ties": wins["tie"],
        "ft_win_rate": wins["B_ft"] / total,
        "total": total,
    }

Limitations of LLM-as-Judge

Position bias: GPT-4o tends to prefer the response shown first (A) in pairwise comparisons. Mitigate by randomizing order and averaging.

Verbosity bias: Judges tend to prefer longer, more detailed responses even when shorter answers are better. Include explicit instructions to not favor length.

Self-enhancement bias: GPT-4o may rate GPT-4o-generated responses higher. Use Claude or a different model family as judge.

Hallucination detection: LLM judges struggle to detect subtle factual errors in specialized domains. Supplement with domain expert review.

Cost: Judging 1,000 examples with GPT-4o at 500 tokens per evaluation costs roughly $1.50–$3.00. Budget accordingly for large evaluation sets.

For mission-critical evaluation (medical, legal), always include a human review sample alongside LLM-as-judge.