LLM Evaluation Q&A · Lesson 16 of 16
Interview: LLM Evaluation Scenario Questions
Q1: Why is LLM evaluation hard compared to traditional ML evaluation?
A: Traditional ML has ground truth labels — you predict a class, you either match it or don't. Accuracy is unambiguous.
LLMs generate open-ended text where many valid outputs exist. There's no single correct answer to "Explain warfarin's mechanism" — multiple responses at different levels of detail are all valid. Standard metrics like accuracy don't apply.
Problems specific to LLM evaluation:
- Reference dependence: Metrics like BLEU need reference answers, but a good response that uses different words scores low
- Dimensionality: Quality has multiple dimensions — accuracy, completeness, clarity, tone — and these can trade off
- Subjectivity: What's "good" depends on the audience and use case
- Non-determinism: Same input produces different outputs each run, making reproducibility harder
The field has converged on three main approaches: automatic metrics (BERTScore), LLM-as-judge, and human evaluation. Each compensates for the others' weaknesses.
Q2: What is BERTScore and when should you use it?
A: BERTScore measures semantic similarity between generated text and reference text using contextual embeddings from a pre-trained model (DeBERTa). Unlike BLEU/ROUGE which count exact word overlaps, BERTScore matches tokens based on meaning in context.
Use BERTScore when:
- You have reference answers and want a fast, automatic semantic similarity score
- You need to detect paraphrases (same meaning, different words)
- LLM-as-judge is too expensive at your evaluation volume
BERTScore correlates much better with human judgments than BLEU or ROUGE, especially for longer, more complex text.
Limitations: doesn't detect factual errors or hallucinations (only measures semantic similarity to reference), and requires a good reference answer.
Q3: How do you design a judge prompt for LLM-as-judge evaluation?
A: A good judge prompt has five elements:
- Role: "You are an expert clinical pharmacology evaluator"
- Context: The question being evaluated
- Response to evaluate: The model output (clearly labeled)
- Criteria: Specific, defined dimensions (factual_accuracy 1-5, completeness 1-5)
- Output format: JSON for reliable parsing
prompt = f"""You are an expert clinical pharmacology evaluator.
Question: {question}
Response: {response}
Score on these criteria (1=poor, 5=excellent):
- factual_accuracy: Is every claim medically correct?
- completeness: Are mechanism, significance, and management covered?
- actionability: Is there clear clinical guidance?
Return JSON only:
{{"factual_accuracy": <1-5>, "completeness": <1-5>, "actionability": <1-5>, "overall": <1-5>}}"""Key rules: use temperature=0 or 0.1 for consistency. Always request JSON output. Avoid criteria that can't be objectively scored.
Q4: What are the main biases in LLM-as-judge evaluation?
A:
Position bias: Judges prefer responses shown first (A position) in pairwise comparisons. Fix: randomize order across two runs, take majority vote.
Verbosity bias: Judges prefer longer responses. Fix: add explicit instruction "do not prefer responses based on length."
Self-enhancement bias: GPT-4o judges prefer GPT-4o outputs; Claude judges prefer Claude outputs. Fix: use a different model family as judge.
Calibration inconsistency: The same judge gives different scores on the same input across runs. Fix: use temperature=0, run 2–3 times and average.
Domain blind spots: LLM judges may not catch subtle factual errors in specialized domains. Fix: supplement with domain expert review for critical applications.
Measuring bias: run the judge on identical responses in different positions, with different lengths. Deviations from expected behavior quantify the bias.
Q5: What are the four RAGAS metrics and what does each measure?
A:
Faithfulness: Does the answer stick to retrieved context? High faithfulness = no hallucination beyond what was retrieved. Measured by checking each claim in the answer against the context.
Answer Relevancy: Does the answer address the question asked? High relevancy = answer directly answers the question. Measured by reverse-generating questions from the answer and checking if they match the original.
Context Precision: Is the retrieved context relevant? High precision = all retrieved chunks are useful. Measured by checking what fraction of retrieved context is actually used in the answer.
Context Recall: Did retrieval capture all needed information? High recall = context contains everything needed to answer. Measured against ground truth (requires gold answers).
Use these together: low faithfulness → LLM hallucination problem. Low context recall → retrieval problem. Low precision → chunking or retrieval noise problem.
Q6: How do you run evaluations in CI/CD to catch regressions?
A: The pattern:
- Maintain a golden dataset (100–500 test cases with expected answers)
- On every PR that changes prompts, retrieval, or model config: run the eval suite
- Score responses with LLM-as-judge on relevant criteria
- Compare against threshold (e.g., overall quality must be 0.82+)
- If any metric drops below threshold: fail the build, block the merge
# GitHub Actions
- name: Run eval suite
run: python evals/run_evals.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}# The eval script exits with code 1 on failure — blocks merge
if not report["passed"]:
sys.exit(1)Also maintain regression tests — specific scenarios that must always pass (safety refusals, never-hallucinate-about-specific-drugs). These have no threshold — they're binary pass/fail.
Q7: What is the difference between offline and online evaluation?
A:
Offline evaluation: Evaluate on a fixed dataset before deployment. Fast, cheap, controlled, reproducible. Limited because the dataset may not represent real user behavior.
Online evaluation: Measure quality in production with real users. Methods: user thumbs up/down, task completion rates, return visit rates, A/B testing. Represents actual usage but is slower and noisier.
Use offline evals to gate deployment (does this change break anything?). Use online evals to measure actual impact on users (did this change make the product better?).
Offline + online is the complete picture. Offline without online: you might deploy something that passes benchmarks but users don't like. Online without offline: you catch problems only after users experience them.
Q8: How do you choose between pointwise and pairwise evaluation?
A:
Pointwise: Score each response individually (e.g., 1–5). Good for tracking quality over time and computing absolute scores. Problematic for distinguishing similar-quality responses.
Pairwise: Compare two responses directly, pick the better one. More reliable for close comparisons. Used in Chatbot Arena. Doesn't give absolute scores.
Use pairwise when: comparing two specific model variants, selecting between fine-tuning configurations, or when responses are similar in quality and you need a reliable winner.
Use pointwise when: tracking quality trend over product versions, computing summary statistics across a test set, or evaluating many variants against a fixed standard.
For fine-tuning comparison: pairwise between fine-tuned and base model. For tracking quality over 6 months of product development: pointwise on the same test set.
Q9: How do you handle evaluation for a RAG system end-to-end?
A: A RAG system has two components with distinct failure modes: retrieval and generation.
Retrieval evaluation: Context precision (how much retrieved content is relevant) and context recall (did we retrieve all needed information). Run offline on test queries with known relevant documents.
Generation evaluation: Faithfulness (does answer stick to context?), answer relevancy (does it address the question?).
End-to-end: Ground truth answers evaluated via BERTScore or LLM-as-judge against the final generated response.
Full pipeline:
ragas_scores = evaluate(dataset, metrics=[
faithfulness, answer_relevancy, context_precision, context_recall
])Where to look when scores are low:
- Low faithfulness → generation problem (hallucination)
- Low context recall → retrieval problem (missing documents)
- Low context precision → chunking/retrieval noise problem
- Low answer relevancy → prompt or LLM issue
Q10: What makes a good golden dataset?
A: A golden dataset is only as good as its coverage and quality.
Coverage: Include examples from every important query type in your application. For a drug information system: interactions, contraindications, mechanisms, dosing, adverse effects, monitoring. Gaps in coverage mean you miss whole failure modes.
Diversity: Different phrasings of similar questions, different drugs, different complexity levels. A dataset of 200 warfarin questions doesn't tell you about other drugs.
Difficulty calibration: Include easy, medium, and hard questions. An all-easy dataset won't catch subtle model failures.
Ground truth quality: Reference answers must be correct and comprehensive. Errors in the golden dataset lead to wrong evaluation conclusions — you might reject a better model because it disagrees with wrong reference answers.
Refresh cadence: The golden dataset needs updating as your application evolves. Add failing cases from production (when users report problems), newly discovered edge cases, and new capability areas.
Size: 100–200 examples for a focused domain, 500+ for a broader application.
Q11: How do you measure calibration of an LLM's confidence?
A: Calibration: when the model says it's 90% confident, it should be right 90% of the time.
For classification-type outputs, extract logprobs:
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
logprobs=True,
top_logprobs=5,
)
# Probability of the first token (for classification)
import math
first_token_logprob = response.choices[0].logprobs.content[0].logprob
probability = math.exp(first_token_logprob)For generation, calibration is harder — the model rarely gives explicit confidence scores. Approaches: verbal elicitation ("How confident are you? 1-10"), multi-sample consistency (if 9/10 samples give the same answer, high confidence), or trained uncertainty estimators.
Medical AI consideration: a confidently wrong answer about drug interactions is worse than a correctly uncertain answer. Prefer models that say "I'm not certain about this specific drug combination" over ones that always sound authoritative.
Q12: System design — design an evaluation system for a medical AI assistant.
A: Requirements: the AI answers drug information questions. Evaluation must catch factual errors, hallucinations, and safety failures.
Components:
Offline eval suite (CI):
- 500-question golden dataset across 10 drug question categories
- Expert pharmacist reviewed 100% of reference answers
- LLM-as-judge scoring on accuracy, completeness, safety (5-dimension rubric)
- Thresholds: overall 0.85+, safety dimension 0.95+
- Regression test suite: 30 safety scenarios that must all pass (binary)
RAGAS monitoring (continuous):
- Run RAGAS on 100 sampled production queries weekly
- Alert if faithfulness drops below 0.90 (hallucination signal)
- Alert if context recall drops below 0.80 (retrieval signal)
Online feedback:
- Pharmacist users rate responses (thumbs up/down + optional comment)
- Weekly report: rating rate, approval rate by category, free text analysis
Human review:
- Random 50 queries reviewed weekly by senior clinical pharmacist
- Tracks categories: factual error, incomplete, correct
- Monthly calibration of LLM judge against human scores
Alert escalation: Any single factual error in a safety-critical drug category (high-alert medications: warfarin, insulin, opioids) triggers immediate review and potential system pause.