RAG Systems · Lesson 23 of 24
RAGAS: Automated RAG Evaluation Framework
What RAGAS Is
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework for evaluating RAG pipelines. It provides four reference-free metrics that require no human labels:
Faithfulness: Does the answer stick to the retrieved context?
Answer Relevance: Does the answer address the question?
Context Precision: Are the retrieved chunks relevant to the question?
Context Recall: Does the retrieved context cover the answer?"Reference-free" means RAGAS uses an LLM (typically GPT-4 or Claude) to compute the metrics — no manually labelled answer dataset required.
The Four Metrics in Detail
Faithfulness (0–1):
Each factual claim in the answer is extracted
Each claim is checked: is it supported by the retrieved context?
Score = supported claims / total claims
Example:
Answer: "Warfarin is contraindicated in pregnancy. Dose is 5mg daily."
Context: Only mentions contraindication in pregnancy (not dose)
Faithfulness = 1/2 = 0.50 ← poor
Answer Relevance (0–1):
Generate N reverse questions from the answer
Compute similarity between those questions and the original query
Score = mean cosine similarity
Penalises: incomplete answers, off-topic answers
Context Precision (0–1):
Ranks retrieved chunks by how useful they are to answering the question
Score = weighted precision over the ranking
High score: most useful chunks are ranked first
Context Recall (0–1):
Requires a reference answer
Checks: what fraction of reference answer claims are covered by retrieved context?
Score = (claims in reference supported by context) / total claims in referenceInstallation and Basic Usage
# pip install ragas langchain anthropic
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Build evaluation dataset
samples = [
{
"question": "What is the INR target range for AF patients on warfarin?",
"answer": "The INR target range for AF patients on warfarin is 2.0–3.0.",
"contexts": [
"For patients with atrial fibrillation on warfarin, the target INR is 2.0–3.0.",
"INR monitoring should occur every 4 weeks when stable.",
],
"ground_truth": "The target INR for AF patients on warfarin is 2.0–3.0.",
},
{
"question": "What are the symptoms of warfarin overdose?",
"answer": "Symptoms include unusual bleeding, bruising, and prolonged clotting.",
"contexts": [
"Warfarin toxicity presents with excessive bleeding from minor wounds.",
"Signs of overdose include haematuria, epistaxis, and prolonged prothrombin time.",
],
"ground_truth": "Warfarin overdose causes excessive bleeding, haematuria, epistaxis, and prolonged clotting time.",
},
]
dataset = Dataset.from_list(samples)
# Run RAGAS evaluation
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85, 'context_recall': 0.79}Using Claude as the Evaluator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings
# Use Claude for evaluation (instead of default GPT-4)
claude_llm = LangchainLLMWrapper(
ChatAnthropic(model="claude-sonnet-4-6", temperature=0)
)
openai_embs = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=claude_llm,
embeddings=openai_embs,
)End-to-End RAG Evaluation Pipeline
from dataclasses import dataclass
from typing import Callable
@dataclass
class RAGPipeline:
retrieve: Callable[[str], list[str]] # query → list of context strings
generate: Callable[[str, list[str]], str] # query + context → answer
def evaluate_rag_pipeline(
pipeline: RAGPipeline,
questions: list[str],
ground_truths: list[str],
) -> dict:
# Run pipeline on all questions
answers = []
contexts_list = []
for question in questions:
contexts = pipeline.retrieve(question)
answer = pipeline.generate(question, contexts)
answers.append(answer)
contexts_list.append(contexts)
dataset = Dataset.from_dict({
"question": questions,
"answer": answers,
"contexts": contexts_list,
"ground_truth": ground_truths,
})
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
return {
"faithfulness": result["faithfulness"],
"answer_relevancy": result["answer_relevancy"],
"context_precision": result["context_precision"],
"context_recall": result["context_recall"],
"mean_score": sum([
result["faithfulness"],
result["answer_relevancy"],
result["context_precision"],
result["context_recall"],
]) / 4,
}Interpreting RAGAS Results
Low faithfulness (< 0.8):
The LLM is adding information not in the context
Fix: strengthen grounding instructions ("ONLY use the provided context")
lower temperature; add output classifier
Low answer relevancy (< 0.8):
The answer is off-topic or incomplete
Fix: improve prompt structure; add explicit "answer the question directly" instruction
Low context precision (< 0.7):
Retrieved chunks contain irrelevant content
Fix: improve embedding model; add metadata filters; increase similarity threshold
Low context recall (< 0.7):
Retrieved context doesn't cover the answer
Fix: add more documents; improve chunking; lower similarity threshold;
add query expansion; use hybrid retrieval
Clinical minimum thresholds:
Faithfulness: > 0.90 (safety-critical)
Answer Relevance: > 0.85
Context Precision: > 0.75
Context Recall: > 0.80Interview Answer
"RAGAS provides four reference-free RAG evaluation metrics: faithfulness (answer claims supported by context — the most safety-critical metric), answer relevancy (answer addresses the question), context precision (retrieved chunks are relevant, ranked well), and context recall (retrieved context covers the reference answer — the one metric requiring a reference). RAGAS uses an LLM to compute these automatically, removing the need for a manually labelled evaluation dataset. I run RAGAS in CI on a curated test set of 50–100 questions after any pipeline change: a faithfulness drop below 0.90 blocks deployment, as unfaithful clinical answers are a safety risk."