Interview: LLM Evaluation Scenario Questions — LLM Evaluation Q&A | Learnixo

Q1: Why is LLM evaluation hard compared to traditional ML evaluation?

A: Traditional ML has ground truth labels — you predict a class, you either match it or don't. Accuracy is unambiguous.

LLMs generate open-ended text where many valid outputs exist. There's no single correct answer to "Explain warfarin's mechanism" — multiple responses at different levels of detail are all valid. Standard metrics like accuracy don't apply.

Problems specific to LLM evaluation:

Reference dependence: Metrics like BLEU need reference answers, but a good response that uses different words scores low
Dimensionality: Quality has multiple dimensions — accuracy, completeness, clarity, tone — and these can trade off
Subjectivity: What's "good" depends on the audience and use case
Non-determinism: Same input produces different outputs each run, making reproducibility harder

The field has converged on three main approaches: automatic metrics (BERTScore), LLM-as-judge, and human evaluation. Each compensates for the others' weaknesses.

Q2: What is BERTScore and when should you use it?

A: BERTScore measures semantic similarity between generated text and reference text using contextual embeddings from a pre-trained model (DeBERTa). Unlike BLEU/ROUGE which count exact word overlaps, BERTScore matches tokens based on meaning in context.

Use BERTScore when:

You have reference answers and want a fast, automatic semantic similarity score
You need to detect paraphrases (same meaning, different words)
LLM-as-judge is too expensive at your evaluation volume

BERTScore correlates much better with human judgments than BLEU or ROUGE, especially for longer, more complex text.

Limitations: doesn't detect factual errors or hallucinations (only measures semantic similarity to reference), and requires a good reference answer.

Q3: How do you design a judge prompt for LLM-as-judge evaluation?

A: A good judge prompt has five elements:

Role: "You are an expert clinical pharmacology evaluator"
Context: The question being evaluated
Response to evaluate: The model output (clearly labeled)
Criteria: Specific, defined dimensions (factual_accuracy 1-5, completeness 1-5)
Output format: JSON for reliable parsing

Python

prompt = f"""You are an expert clinical pharmacology evaluator.

Question: {question}
Response: {response}

Score on these criteria (1=poor, 5=excellent):
- factual_accuracy: Is every claim medically correct?
- completeness: Are mechanism, significance, and management covered?
- actionability: Is there clear clinical guidance?

Return JSON only:
{{"factual_accuracy": <1-5>, "completeness": <1-5>, "actionability": <1-5>, "overall": <1-5>}}"""

Key rules: use temperature=0 or 0.1 for consistency. Always request JSON output. Avoid criteria that can't be objectively scored.

Q4: What are the main biases in LLM-as-judge evaluation?

Position bias: Judges prefer responses shown first (A position) in pairwise comparisons. Fix: randomize order across two runs, take majority vote.

Verbosity bias: Judges prefer longer responses. Fix: add explicit instruction "do not prefer responses based on length."

Self-enhancement bias: GPT-4o judges prefer GPT-4o outputs; Claude judges prefer Claude outputs. Fix: use a different model family as judge.

Calibration inconsistency: The same judge gives different scores on the same input across runs. Fix: use temperature=0, run 2–3 times and average.

Domain blind spots: LLM judges may not catch subtle factual errors in specialized domains. Fix: supplement with domain expert review for critical applications.

Measuring bias: run the judge on identical responses in different positions, with different lengths. Deviations from expected behavior quantify the bias.

Q5: What are the four RAGAS metrics and what does each measure?

Faithfulness: Does the answer stick to retrieved context? High faithfulness = no hallucination beyond what was retrieved. Measured by checking each claim in the answer against the context.

Answer Relevancy: Does the answer address the question asked? High relevancy = answer directly answers the question. Measured by reverse-generating questions from the answer and checking if they match the original.

Context Precision: Is the retrieved context relevant? High precision = all retrieved chunks are useful. Measured by checking what fraction of retrieved context is actually used in the answer.

Context Recall: Did retrieval capture all needed information? High recall = context contains everything needed to answer. Measured against ground truth (requires gold answers).

Use these together: low faithfulness → LLM hallucination problem. Low context recall → retrieval problem. Low precision → chunking or retrieval noise problem.

Q6: How do you run evaluations in CI/CD to catch regressions?

A: The pattern:

Maintain a golden dataset (100–500 test cases with expected answers)
On every PR that changes prompts, retrieval, or model config: run the eval suite
Score responses with LLM-as-judge on relevant criteria
Compare against threshold (e.g., overall quality must be 0.82+)
If any metric drops below threshold: fail the build, block the merge

YAML

# GitHub Actions
- name: Run eval suite
  run: python evals/run_evals.py
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Python

# The eval script exits with code 1 on failure — blocks merge
if not report["passed"]:
    sys.exit(1)

Also maintain regression tests — specific scenarios that must always pass (safety refusals, never-hallucinate-about-specific-drugs). These have no threshold — they're binary pass/fail.

Q7: What is the difference between offline and online evaluation?

Offline evaluation: Evaluate on a fixed dataset before deployment. Fast, cheap, controlled, reproducible. Limited because the dataset may not represent real user behavior.

Online evaluation: Measure quality in production with real users. Methods: user thumbs up/down, task completion rates, return visit rates, A/B testing. Represents actual usage but is slower and noisier.

Use offline evals to gate deployment (does this change break anything?). Use online evals to measure actual impact on users (did this change make the product better?).

Offline + online is the complete picture. Offline without online: you might deploy something that passes benchmarks but users don't like. Online without offline: you catch problems only after users experience them.

Q8: How do you choose between pointwise and pairwise evaluation?

Pointwise: Score each response individually (e.g., 1–5). Good for tracking quality over time and computing absolute scores. Problematic for distinguishing similar-quality responses.

Pairwise: Compare two responses directly, pick the better one. More reliable for close comparisons. Used in Chatbot Arena. Doesn't give absolute scores.

Use pairwise when: comparing two specific model variants, selecting between fine-tuning configurations, or when responses are similar in quality and you need a reliable winner.

Use pointwise when: tracking quality trend over product versions, computing summary statistics across a test set, or evaluating many variants against a fixed standard.

For fine-tuning comparison: pairwise between fine-tuned and base model. For tracking quality over 6 months of product development: pointwise on the same test set.

Q9: How do you handle evaluation for a RAG system end-to-end?

A: A RAG system has two components with distinct failure modes: retrieval and generation.

Retrieval evaluation: Context precision (how much retrieved content is relevant) and context recall (did we retrieve all needed information). Run offline on test queries with known relevant documents.

Generation evaluation: Faithfulness (does answer stick to context?), answer relevancy (does it address the question?).

End-to-end: Ground truth answers evaluated via BERTScore or LLM-as-judge against the final generated response.

Full pipeline:

Python

ragas_scores = evaluate(dataset, metrics=[
    faithfulness, answer_relevancy, context_precision, context_recall
])

Where to look when scores are low:

Low faithfulness → generation problem (hallucination)
Low context recall → retrieval problem (missing documents)
Low context precision → chunking/retrieval noise problem
Low answer relevancy → prompt or LLM issue

Q10: What makes a good golden dataset?

A: A golden dataset is only as good as its coverage and quality.

Coverage: Include examples from every important query type in your application. For a drug information system: interactions, contraindications, mechanisms, dosing, adverse effects, monitoring. Gaps in coverage mean you miss whole failure modes.

Diversity: Different phrasings of similar questions, different drugs, different complexity levels. A dataset of 200 warfarin questions doesn't tell you about other drugs.

Difficulty calibration: Include easy, medium, and hard questions. An all-easy dataset won't catch subtle model failures.

Ground truth quality: Reference answers must be correct and comprehensive. Errors in the golden dataset lead to wrong evaluation conclusions — you might reject a better model because it disagrees with wrong reference answers.

Refresh cadence: The golden dataset needs updating as your application evolves. Add failing cases from production (when users report problems), newly discovered edge cases, and new capability areas.

Size: 100–200 examples for a focused domain, 500+ for a broader application.

Q11: How do you measure calibration of an LLM's confidence?

A: Calibration: when the model says it's 90% confident, it should be right 90% of the time.

For classification-type outputs, extract logprobs:

Python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    logprobs=True,
    top_logprobs=5,
)

# Probability of the first token (for classification)
import math
first_token_logprob = response.choices[0].logprobs.content[0].logprob
probability = math.exp(first_token_logprob)

For generation, calibration is harder — the model rarely gives explicit confidence scores. Approaches: verbal elicitation ("How confident are you? 1-10"), multi-sample consistency (if 9/10 samples give the same answer, high confidence), or trained uncertainty estimators.

Medical AI consideration: a confidently wrong answer about drug interactions is worse than a correctly uncertain answer. Prefer models that say "I'm not certain about this specific drug combination" over ones that always sound authoritative.

Q12: System design — design an evaluation system for a medical AI assistant.

A: Requirements: the AI answers drug information questions. Evaluation must catch factual errors, hallucinations, and safety failures.

Components:

Offline eval suite (CI):

500-question golden dataset across 10 drug question categories
Expert pharmacist reviewed 100% of reference answers
LLM-as-judge scoring on accuracy, completeness, safety (5-dimension rubric)
Thresholds: overall 0.85+, safety dimension 0.95+
Regression test suite: 30 safety scenarios that must all pass (binary)

RAGAS monitoring (continuous):

Run RAGAS on 100 sampled production queries weekly
Alert if faithfulness drops below 0.90 (hallucination signal)
Alert if context recall drops below 0.80 (retrieval signal)

Online feedback:

Pharmacist users rate responses (thumbs up/down + optional comment)
Weekly report: rating rate, approval rate by category, free text analysis

Human review:

Random 50 queries reviewed weekly by senior clinical pharmacist
Tracks categories: factual error, incomplete, correct
Monthly calibration of LLM judge against human scores

Alert escalation: Any single factual error in a safety-critical drug category (high-alert medications: warfarin, insulin, opioids) triggers immediate review and potential system pause.