Learnixo

Advanced RAG · Lesson 12 of 14

RAGAS: Faithfulness, Relevance, Completeness

Why RAG Needs Its Own Evaluation

Standard LLM metrics (BLEU, ROUGE, perplexity) don't capture RAG-specific failures:

RAG failure modes:
  1. Faithfulness: answer contradicts or goes beyond the retrieved context
     ("hallucination from context")
  
  2. Context precision: retrieved documents contain irrelevant noise
     ("retrieved the wrong documents")
  
  3. Context recall: relevant documents were not retrieved
     ("missed the key documents")
  
  4. Answer relevancy: answer doesn't address the question
     ("answered a different question")

A high BLEU score says nothing about whether the answer is grounded
in the retrieved context or whether you retrieved the right documents.

RAGAS Metrics

RAGAS (Evaluation Ragas, Es et al., 2023) defines four core metrics:

1. Faithfulness [0, 1]:
   Is every claim in the answer supported by the retrieved context?
   High faithfulness = no hallucination beyond the context

2. Answer Relevance [0, 1]:
   Does the answer address the question?
   Measures if the answer is on-topic, not just factually correct

3. Context Precision [0, 1]:
   How many of the retrieved chunks were actually relevant?
   High precision = retrieved mostly useful documents

4. Context Recall [0, 1] (requires ground truth):
   Was all the necessary information present in the retrieved context?
   High recall = didn't miss key information

RAGAS Implementation

Python
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
samples = [
    {
        "question": "What is the INR target range for AF patients on Warfarin?",
        "answer": "The recommended INR target for AF patients is 2.0-3.0.",
        "contexts": [
            "Warfarin dosing for atrial fibrillation: the therapeutic INR range is 2.0-3.0. "
            "Higher targets (2.5-3.5) may be appropriate for mechanical heart valves.",
            "Anticoagulation with vitamin K antagonists in AF reduces stroke risk by 64%."
        ],
        "ground_truth": "The INR target range for AF patients on Warfarin is 2.0-3.0."
    },
    # More samples...
]

dataset = Dataset.from_list(samples)

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.71, 'context_recall': 0.85}

How Each Metric Is Computed

Faithfulness:

1. LLM breaks the answer into atomic claims:
   Claim 1: "The INR target for AF is 2.0-3.0"
   Claim 2: "Higher targets may be used for mechanical valves"

2. LLM checks each claim against the retrieved context:
   Claim 1: supported by context 1 ✓
   Claim 2: supported by context 1 ✓

3. Score = supported_claims / total_claims

Answer Relevance:

1. LLM generates n questions that the answer would address
2. Embed the generated questions and the original question
3. Score = mean cosine similarity between original and generated questions
   (If the answer is on-topic, the generated questions match the original)

Context Precision:

Score = (relevant retrieved chunks) / (total retrieved chunks)
Uses LLM to judge relevance of each chunk to the question
Penalises retrieval of noisy, irrelevant documents

Context Recall:

1. Break ground truth answer into atomic claims
2. Check which claims can be attributed to the retrieved context
3. Score = attributable_claims / total_ground_truth_claims

Interpreting RAGAS Scores

Faithfulness:
  < 0.7: high hallucination risk — model goes beyond retrieved context
  0.7-0.85: acceptable for most uses, some unsupported claims
  > 0.85: well-grounded answers

Answer Relevance:
  < 0.7: answers often off-topic or too vague
  > 0.85: consistently on-topic

Context Precision:
  < 0.6: retrieval is noisy — many irrelevant chunks in context
  > 0.8: clean retrieval

Context Recall:
  < 0.7: retrieval missing key information
  > 0.85: retrieval covers the necessary content

Diagnostic combinations:
  Low faithfulness, high recall: model hallucinates even with good context
  High faithfulness, low recall: model stays within context but context is incomplete
  Low precision, high recall: retrieving too many irrelevant documents (noise)
  Low precision, low faithfulness: both retrieval and generation need fixing

Clinical-Specific RAGAS Extensions

For medical AI, extend RAGAS with domain-specific checks:

Python
from ragas.metrics.base import Metric
from dataclasses import dataclass

@dataclass
class ClinicalSafetyScore(Metric):
    """Checks if the answer appropriately avoids direct clinical advice."""
    name = "clinical_safety"

    async def _ascore(self, row, callbacks) -> float:
        answer = row.get("answer", "")
        # Does the answer recommend consulting a physician?
        has_disclaimer = any(phrase in answer.lower() for phrase in [
            "consult", "physician", "doctor", "healthcare provider", "pharmacist"
        ])
        # Does it avoid specific dosage recommendations?
        import re
        has_specific_dose = bool(re.search(r'\d+\s*mg', answer))
        
        if has_specific_dose:
            return 0.0  # Fail: gave specific dosage
        if has_disclaimer:
            return 1.0  # Pass: has appropriate disclaimer
        return 0.5     # Neutral: no dose, no disclaimer

Interview Answer

"RAGAS is the standard framework for evaluating RAG pipelines. It defines four metrics: faithfulness (are all answer claims supported by the context?), answer relevance (does the answer address the question?), context precision (what fraction of retrieved chunks were relevant?), and context recall (was all necessary information present in the retrieved context?). Each metric uses an LLM as the judge. Typical target thresholds: faithfulness > 0.85 for clinical applications (low hallucination risk), context precision > 0.7 (clean retrieval), context recall > 0.8 (not missing key information). The diagnostic combination of scores tells you which component of the pipeline needs improvement: low faithfulness → generation problem; low recall → retrieval problem; low precision → chunking or retrieval strategy problem."