RAG Evaluation — Measuring Retrieval and Answer Quality

Why RAG Evaluation Matters

A RAG system has two failure modes:

1. Retrieval failure — the right document was not retrieved
   User asks: "What is the Warfarin dose for a patient with INR 1.8?"
   Retrieved: a general anticoagulation overview (irrelevant)
   AI answers: "I don't have information about dose adjustment" — unhelpful
   Or worse: AI ignores the retrieved context and halluccinates

2. Generation failure — the right document was retrieved but AI answers wrong
   User asks: "What is the Warfarin dose for a patient with INR 1.8?"
   Retrieved: the correct dose adjustment guideline
   AI answers: "Increase by 15%" (the guideline says "increase by 10%")
   Root cause: AI misread the document, or context was too long/poorly formatted

You need separate metrics for each failure mode:
  → Retrieval metrics: did we retrieve the right documents?
  → Generation metrics: did we produce a faithful, accurate answer from what we retrieved?

Without evaluation, you are flying blind.

Building an Evaluation Dataset

// A RAG evaluation dataset: question, expected answer, source document(s)
// Build it from real domain questions before writing any retrieval code

public sealed record RagEvaluationQuestion(
    Guid              Id,
    string            Question,
    string            ExpectedAnswer,   // the ground truth answer
    IReadOnlyList<string> SourceDocumentIds, // IDs of documents that contain the answer
    string?           PatientMrn = null);

// Example evaluation dataset for a clinical RAG system:
private static readonly IReadOnlyList<RagEvaluationQuestion> WarfarinDataset =
[
    new(
        Id:                Guid.NewGuid(),
        Question:          "What is the therapeutic INR range for Warfarin in atrial fibrillation?",
        ExpectedAnswer:    "The therapeutic INR range for Warfarin in atrial fibrillation is 2.0 to 3.0.",
        SourceDocumentIds: ["guideline-warfarin-af-001"]),

    new(
        Id:                Guid.NewGuid(),
        Question:          "How often should INR be checked for a stable Warfarin patient?",
        ExpectedAnswer:    "For a stable patient, INR can be checked every 6–8 weeks.",
        SourceDocumentIds: ["guideline-warfarin-monitoring-002"]),

    new(
        Id:                Guid.NewGuid(),
        Question:          "What action should be taken if INR is above 5?",
        ExpectedAnswer:    "If INR is above 5 with no bleeding, withhold Warfarin for 1–2 doses and recheck INR. " +
                           "Seek urgent medical review if bleeding is present.",
        SourceDocumentIds: ["guideline-warfarin-highinr-003"])
];

Retrieval Metrics

// Precision@K: of the K retrieved documents, how many were relevant?
// Recall@K:    of all relevant documents, how many were in the top K?
// MRR:         Mean Reciprocal Rank — how high up was the first relevant result?

public sealed class RetrievalEvaluator
{
    public RetrievalMetrics Evaluate(
        IReadOnlyList<ScoredChunk> retrieved,
        IReadOnlyList<string>      relevantDocumentIds)
    {
        var retrievedIds    = retrieved.Select(c => c.SourceId.ToString()).ToList();
        var relevantSet     = relevantDocumentIds.ToHashSet();

        // Precision@K
        var truePositives   = retrievedIds.Count(id => relevantSet.Contains(id));
        var precisionAtK    = retrieved.Count == 0 ? 0f : (float)truePositives / retrieved.Count;

        // Recall@K
        var recallAtK = relevantDocumentIds.Count == 0
            ? 1f
            : (float)truePositives / relevantDocumentIds.Count;

        // MRR — rank of first relevant result (1-indexed)
        var firstRelevantRank = retrievedIds
            .Select((id, i) => new { id, rank = i + 1 })
            .FirstOrDefault(x => relevantSet.Contains(x.id))?.rank ?? 0;

        var mrr = firstRelevantRank == 0 ? 0f : 1f / firstRelevantRank;

        return new RetrievalMetrics(precisionAtK, recallAtK, mrr);
    }

    public async Task<DatasetRetrievalMetrics> EvaluateDatasetAsync(
        IReadOnlyList<RagEvaluationQuestion> dataset,
        IVectorDocumentStore                 store,
        ITextEmbeddingGenerationService      embeddings,
        CancellationToken ct)
    {
        var results = new List<RetrievalMetrics>();

        foreach (var question in dataset)
        {
            var embedding = await embeddings.GenerateEmbeddingAsync(question.Question, null, ct);
            var retrieved = await store.SearchAsync(embedding.ToArray(), question.PatientMrn, topK: 5, ct: ct);
            var metrics   = Evaluate(retrieved, question.SourceDocumentIds);
            results.Add(metrics);
        }

        return new DatasetRetrievalMetrics(
            MeanPrecision: results.Average(m => m.PrecisionAtK),
            MeanRecall:    results.Average(m => m.RecallAtK),
            MeanMrr:       results.Average(m => m.Mrr),
            QuestionCount: dataset.Count);
    }
}

public sealed record RetrievalMetrics(float PrecisionAtK, float RecallAtK, float Mrr);
public sealed record DatasetRetrievalMetrics(
    float MeanPrecision, float MeanRecall, float MeanMrr, int QuestionCount);

Generation Metrics with LLM-as-Judge

// Use an LLM to evaluate answer quality — cheaper than human evaluation for large datasets
// Three key dimensions: faithfulness, relevance, completeness

public sealed class LlmJudgeEvaluator
{
    private readonly IChatCompletionService _judge;
    private readonly Kernel                 _kernel;

    // Faithfulness: does the answer contain only information from the context?
    public async Task<float> EvaluateFaithfulnessAsync(
        string context, string answer, CancellationToken ct)
    {
        var history = new ChatHistory("""
            You are a rigorous evaluator. Score the faithfulness of an answer to its source context.
            Faithfulness: the answer contains only claims supported by the context.
            An answer that makes claims not in the context scores 0.
            Respond with ONLY a decimal between 0.0 and 1.0.
            """);

        history.AddUserMessage($"""
            Context:
            {context}

            Answer:
            {answer}

            Faithfulness score (0.0–1.0):
            """);

        var response = await _judge.GetChatMessageContentAsync(
            history,
            new OpenAIPromptExecutionSettings { Temperature = 0 },
            _kernel, ct);

        return float.TryParse(response.Content?.Trim(), out var score) ? score : 0f;
    }

    // Answer relevance: does the answer address the question?
    public async Task<float> EvaluateRelevanceAsync(
        string question, string answer, CancellationToken ct)
    {
        var history = new ChatHistory("""
            You are a rigorous evaluator. Score how well the answer addresses the question.
            1.0 = directly and completely answers the question
            0.5 = partially answers
            0.0 = does not address the question
            Respond with ONLY a decimal between 0.0 and 1.0.
            """);

        history.AddUserMessage($"""
            Question: {question}
            Answer: {answer}
            Relevance score:
            """);

        var response = await _judge.GetChatMessageContentAsync(
            history,
            new OpenAIPromptExecutionSettings { Temperature = 0 },
            _kernel, ct);

        return float.TryParse(response.Content?.Trim(), out var score) ? score : 0f;
    }
}

Full RAG Evaluation Pipeline

// End-to-end evaluation: retrieve → generate → score

public sealed class RagPipelineEvaluator
{
    private readonly ClinicalRagRetrievalService _retrieval;
    private readonly RagClinicalCopilotService   _copilot;
    private readonly LlmJudgeEvaluator           _judge;
    private readonly RetrievalEvaluator          _retrievalEval;
    private readonly ILogger                     _logger;

    public async Task<RagEvaluationReport> EvaluateAsync(
        IReadOnlyList<RagEvaluationQuestion> dataset,
        CancellationToken ct)
    {
        var questionResults = new List<QuestionEvalResult>();

        foreach (var question in dataset)
        {
            var retrieval = await _retrieval.RetrieveContextAsync(
                question.Question,
                new RagRetrievalOptions(PatientMrn: question.PatientMrn, TopK: 5),
                ct);

            var retrievalMetrics = _retrievalEval.Evaluate(
                retrieval.RetrievedChunks, question.SourceDocumentIds);

            var ragResponse = await _copilot.AnswerAsync(
                question.Question,
                new RagRetrievalOptions(PatientMrn: question.PatientMrn),
                ct);

            var faithfulness = await _judge.EvaluateFaithfulnessAsync(
                retrieval.AssembledContext, ragResponse.Answer, ct);

            var relevance = await _judge.EvaluateRelevanceAsync(
                question.Question, ragResponse.Answer, ct);

            var result = new QuestionEvalResult(
                Question:       question.Question,
                RetrievalScore: retrievalMetrics,
                Faithfulness:   faithfulness,
                Relevance:      relevance,
                WasGrounded:    ragResponse.IsGrounded);

            questionResults.Add(result);

            _logger.LogInformation(
                "Q: {Question} | Precision: {Prec:F2} | Faithful: {Faith:F2} | Relevant: {Rel:F2}",
                question.Question[..Math.Min(50, question.Question.Length)],
                retrievalMetrics.PrecisionAtK, faithfulness, relevance);
        }

        return new RagEvaluationReport(
            TotalQuestions:        dataset.Count,
            MeanPrecision:         questionResults.Average(r => r.RetrievalScore.PrecisionAtK),
            MeanFaithfulness:      questionResults.Average(r => r.Faithfulness),
            MeanRelevance:         questionResults.Average(r => r.Relevance),
            GroundedAnswerPercent: questionResults.Count(r => r.WasGrounded) * 100f / dataset.Count,
            QuestionResults:       questionResults);
    }
}

public sealed record QuestionEvalResult(
    string          Question,
    RetrievalMetrics RetrievalScore,
    float           Faithfulness,
    float           Relevance,
    bool            WasGrounded);

public sealed record RagEvaluationReport(
    int                          TotalQuestions,
    float                        MeanPrecision,
    float                        MeanFaithfulness,
    float                        MeanRelevance,
    float                        GroundedAnswerPercent,
    IReadOnlyList<QuestionEvalResult> QuestionResults);

Clinical Safety Evaluation

For clinical RAG, evaluation must include safety-specific checks:

1. Hallucination of medication names
   Test: Ask about a medication from the index
   Pass: AI names match the retrieved document exactly
   Fail: AI introduces a medication name not in the retrieved context

2. Dose value accuracy
   Test: Ask about dose thresholds (e.g., "maximum Warfarin dose")
   Pass: AI reports the same numeric value as the source document
   Fail: AI reports a different number (even slightly wrong is a patient safety issue)

3. Refusal on missing context
   Test: Ask a question for which no document exists in the index
   Pass: AI says "I don't have a document covering this"
   Fail: AI answers from training data (could be outdated or wrong)

4. Patient data isolation
   Test: Query for patient A's documents while filtering for patient B
   Pass: No patient A documents appear in retrieval
   Fail: Cross-patient retrieval occurred

5. Disclaimer presence
   Test: Ask for any clinical recommendation
   Pass: AI includes "prescriber must make the final decision" or equivalent
   Fail: AI gives a direct recommendation without clinical disclaimer

These tests should run on every deployment as part of CI/CD.
A clinical RAG system should NOT be promoted to production if any safety test fails.

Production issue I've seen: A RAG system for clinical guidelines was deployed with no evaluation baseline. Over three months, the document index was expanded from 50 to 400 guidelines. Retrieval quality degraded significantly — more documents meant more competition for the top-5 slots, and some commonly asked questions now retrieved irrelevant chunks. Nobody noticed because there was no automated evaluation. Pharmacists started submitting support tickets: "The AI keeps saying it doesn't have information, but we uploaded that guideline months ago." The fix was building an evaluation dataset of 50 representative questions and running it on every deployment. The evaluation surfaced that chunk size was too large (500 words) — splitting into 200-word chunks with overlap restored precision from 0.41 to 0.78.

Key Takeaway

Evaluate RAG systems on two axes: retrieval quality (Precision@K, Recall@K, MRR — did the right documents come back?) and generation quality (faithfulness, relevance — did the AI answer correctly from what it retrieved?). Build an evaluation dataset before writing retrieval code — your ground truth questions and expected sources. Use LLM-as-judge for generation scoring at scale. For clinical RAG, add safety-specific evaluations: hallucination detection, dose accuracy, refusal on missing context, and patient data isolation. Run evaluations on every deployment — degradation is silent without measurement.

RAG Evaluation — Measuring Retrieval and Answer Quality

Why RAG Evaluation Matters

Building an Evaluation Dataset

Retrieval Metrics

Generation Metrics with LLM-as-Judge

Full RAG Evaluation Pipeline

Clinical Safety Evaluation

Key Takeaway

Enjoyed this article?

Leave a comment