Advanced RAG · Lesson 12 of 14
RAGAS: Faithfulness, Relevance, Completeness
Why RAG Needs Its Own Evaluation
Standard LLM metrics (BLEU, ROUGE, perplexity) don't capture RAG-specific failures:
RAG failure modes:
1. Faithfulness: answer contradicts or goes beyond the retrieved context
("hallucination from context")
2. Context precision: retrieved documents contain irrelevant noise
("retrieved the wrong documents")
3. Context recall: relevant documents were not retrieved
("missed the key documents")
4. Answer relevancy: answer doesn't address the question
("answered a different question")
A high BLEU score says nothing about whether the answer is grounded
in the retrieved context or whether you retrieved the right documents.RAGAS Metrics
RAGAS (Evaluation Ragas, Es et al., 2023) defines four core metrics:
1. Faithfulness [0, 1]:
Is every claim in the answer supported by the retrieved context?
High faithfulness = no hallucination beyond the context
2. Answer Relevance [0, 1]:
Does the answer address the question?
Measures if the answer is on-topic, not just factually correct
3. Context Precision [0, 1]:
How many of the retrieved chunks were actually relevant?
High precision = retrieved mostly useful documents
4. Context Recall [0, 1] (requires ground truth):
Was all the necessary information present in the retrieved context?
High recall = didn't miss key informationRAGAS Implementation
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
samples = [
{
"question": "What is the INR target range for AF patients on Warfarin?",
"answer": "The recommended INR target for AF patients is 2.0-3.0.",
"contexts": [
"Warfarin dosing for atrial fibrillation: the therapeutic INR range is 2.0-3.0. "
"Higher targets (2.5-3.5) may be appropriate for mechanical heart valves.",
"Anticoagulation with vitamin K antagonists in AF reduces stroke risk by 64%."
],
"ground_truth": "The INR target range for AF patients on Warfarin is 2.0-3.0."
},
# More samples...
]
dataset = Dataset.from_list(samples)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
# 'context_precision': 0.71, 'context_recall': 0.85}How Each Metric Is Computed
Faithfulness:
1. LLM breaks the answer into atomic claims:
Claim 1: "The INR target for AF is 2.0-3.0"
Claim 2: "Higher targets may be used for mechanical valves"
2. LLM checks each claim against the retrieved context:
Claim 1: supported by context 1 ✓
Claim 2: supported by context 1 ✓
3. Score = supported_claims / total_claimsAnswer Relevance:
1. LLM generates n questions that the answer would address
2. Embed the generated questions and the original question
3. Score = mean cosine similarity between original and generated questions
(If the answer is on-topic, the generated questions match the original)Context Precision:
Score = (relevant retrieved chunks) / (total retrieved chunks)
Uses LLM to judge relevance of each chunk to the question
Penalises retrieval of noisy, irrelevant documentsContext Recall:
1. Break ground truth answer into atomic claims
2. Check which claims can be attributed to the retrieved context
3. Score = attributable_claims / total_ground_truth_claimsInterpreting RAGAS Scores
Faithfulness:
< 0.7: high hallucination risk — model goes beyond retrieved context
0.7-0.85: acceptable for most uses, some unsupported claims
> 0.85: well-grounded answers
Answer Relevance:
< 0.7: answers often off-topic or too vague
> 0.85: consistently on-topic
Context Precision:
< 0.6: retrieval is noisy — many irrelevant chunks in context
> 0.8: clean retrieval
Context Recall:
< 0.7: retrieval missing key information
> 0.85: retrieval covers the necessary content
Diagnostic combinations:
Low faithfulness, high recall: model hallucinates even with good context
High faithfulness, low recall: model stays within context but context is incomplete
Low precision, high recall: retrieving too many irrelevant documents (noise)
Low precision, low faithfulness: both retrieval and generation need fixingClinical-Specific RAGAS Extensions
For medical AI, extend RAGAS with domain-specific checks:
from ragas.metrics.base import Metric
from dataclasses import dataclass
@dataclass
class ClinicalSafetyScore(Metric):
"""Checks if the answer appropriately avoids direct clinical advice."""
name = "clinical_safety"
async def _ascore(self, row, callbacks) -> float:
answer = row.get("answer", "")
# Does the answer recommend consulting a physician?
has_disclaimer = any(phrase in answer.lower() for phrase in [
"consult", "physician", "doctor", "healthcare provider", "pharmacist"
])
# Does it avoid specific dosage recommendations?
import re
has_specific_dose = bool(re.search(r'\d+\s*mg', answer))
if has_specific_dose:
return 0.0 # Fail: gave specific dosage
if has_disclaimer:
return 1.0 # Pass: has appropriate disclaimer
return 0.5 # Neutral: no dose, no disclaimerInterview Answer
"RAGAS is the standard framework for evaluating RAG pipelines. It defines four metrics: faithfulness (are all answer claims supported by the context?), answer relevance (does the answer address the question?), context precision (what fraction of retrieved chunks were relevant?), and context recall (was all necessary information present in the retrieved context?). Each metric uses an LLM as the judge. Typical target thresholds: faithfulness > 0.85 for clinical applications (low hallucination risk), context precision > 0.7 (clean retrieval), context recall > 0.8 (not missing key information). The diagnostic combination of scores tells you which component of the pipeline needs improvement: low faithfulness → generation problem; low recall → retrieval problem; low precision → chunking or retrieval strategy problem."