RAGAS: Evaluating RAG Pipelines
Use the RAGAS framework to measure RAG pipeline quality across four dimensions: faithfulness, answer relevancy, context precision, and context recall.
What is RAGAS?
RAGAS (Retrieval Augmented Generation Assessment) is a framework specifically designed to evaluate RAG pipelines. Standard LLM evaluation metrics don't capture the unique failure modes of RAG: bad retrieval, hallucination of non-retrieved facts, and answers that don't match retrieved context.
RAGAS decomposes RAG quality into four metrics, each targeting a different failure mode.
The Four RAGAS Metrics
1. Faithfulness
Measures: Does the answer stick to the retrieved context, or does it hallucinate beyond it?
An answer is faithful if every claim can be traced back to the retrieved context.
Score: 0 to 1 (1 = fully faithful, 0 = entirely hallucinated)
2. Answer Relevancy
Measures: Does the answer address the actual question asked?
A relevant answer directly answers the question. An irrelevant answer may be factually accurate but doesn't address what was asked.
Score: 0 to 1 (1 = highly relevant, 0 = not relevant)
3. Context Precision
Measures: Is the retrieved context relevant to the question? Are we retrieving noise?
High precision: all retrieved chunks are relevant. Low precision: many irrelevant chunks were retrieved.
Score: 0 to 1 (1 = all context relevant, 0 = no relevant context)
4. Context Recall
Measures: Did we retrieve all the information needed to answer the question?
High recall: the retrieved context contains all information needed. Low recall: the answer requires information not in retrieved chunks.
Score: 0 to 1 (1 = all needed info retrieved, 0 = nothing useful retrieved)
RAGAS Evaluation Setup
pip install ragas langchain-openaifrom ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Your RAG pipeline test data
data = {
"question": [
"What are the contraindications for metformin?",
"How does warfarin interact with aspirin?",
"What is the mechanism of action of beta-blockers?",
],
"answer": [
"Metformin is contraindicated in patients with eGFR below 30 mL/min/1.73m2, metabolic acidosis, and those undergoing iodinated contrast imaging.",
"Warfarin and aspirin together significantly increase bleeding risk. Aspirin inhibits platelet aggregation and can displace warfarin from protein binding sites, elevating free warfarin levels.",
"Beta-blockers competitively block beta-adrenergic receptors, reducing heart rate and myocardial contractility.",
],
"contexts": [
# List of retrieved chunks for each question
[
"Metformin is contraindicated when eGFR is below 30 mL/min/1.73m2 due to risk of lactic acidosis.",
"Metformin should be held before iodinated contrast procedures.",
"Metabolic acidosis is a contraindication for metformin therapy.",
],
[
"NSAIDs like aspirin inhibit platelet aggregation via COX-1 inhibition.",
"Aspirin can displace warfarin from albumin binding sites, increasing free warfarin concentration.",
"Combined warfarin and NSAID use significantly increases GI bleeding risk.",
],
[
"Beta-adrenergic blockers competitively antagonize catecholamines at beta-1 and beta-2 receptors.",
"The result of beta-blockade is decreased heart rate, reduced myocardial contractility, and lowered blood pressure.",
],
],
"ground_truth": [
"Metformin contraindications include eGFR below 30, metabolic acidosis, and iodinated contrast procedures.",
"Warfarin and aspirin increase bleeding risk through platelet inhibition and protein displacement.",
"Beta-blockers work by blocking beta-adrenergic receptors, reducing heart rate and contractility.",
],
}
dataset = Dataset.from_dict(data)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.92, 'context_recall': 0.85}Interpreting Results
| Score | Faithfulness | Answer Relevancy | Context Precision | Context Recall | |---|---|---|---|---| | 0.90+ | Almost no hallucination | Highly on-topic | Retrieval is clean | Captures nearly all needed info | | 0.75–0.90 | Minor hallucination | Mostly relevant | Some irrelevant chunks | Minor gaps in coverage | | 0.60–0.75 | Noticeable hallucination | Often off-topic | Many irrelevant chunks | Missing important info | | Below 0.60 | Significant hallucination | Frequently off-topic | Retrieval largely irrelevant | Severe coverage gaps |
Using Custom LLM for Evaluation
By default RAGAS uses OpenAI. Customize to use any model:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
# Use a specific model for evaluation
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
eval_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=eval_llm,
embeddings=eval_embeddings,
)Diagnosing RAG Problems with RAGAS
RAGAS metrics point directly at which component of your RAG pipeline needs improvement:
| Problem Symptoms | Likely Cause | Fix | |---|---|---| | Low faithfulness only | LLM hallucinating beyond retrieved content | Stronger grounding prompt; temperature reduction | | Low answer relevancy only | LLM ignoring the question | Improve system prompt; add explicit instruction to answer the question | | Low context precision only | Retriever returning irrelevant chunks | Improve chunking; add metadata filtering | | Low context recall only | Retriever missing relevant content | Increase top-k; improve embedding model; add hybrid search | | Low precision AND recall | Fundamental retrieval problem | Review chunking strategy and embedding model |
Continuous Monitoring Pipeline
import json
from datetime import datetime
def run_ragas_evaluation(rag_pipeline, test_questions: list[str]) -> dict:
"""Run full RAGAS evaluation on a RAG pipeline."""
data = {"question": [], "answer": [], "contexts": [], "ground_truth": []}
for q in test_questions:
result = rag_pipeline.query(q) # Your RAG pipeline
data["question"].append(q)
data["answer"].append(result["answer"])
data["contexts"].append(result["source_chunks"])
data["ground_truth"].append(result.get("expected_answer", ""))
dataset = Dataset.from_dict(data)
scores = evaluate(dataset=dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
return {
"timestamp": datetime.utcnow().isoformat(),
"faithfulness": float(scores["faithfulness"]),
"answer_relevancy": float(scores["answer_relevancy"]),
"context_precision": float(scores["context_precision"]),
"context_recall": float(scores["context_recall"]),
"n_questions": len(test_questions),
}
# Run weekly and save results
eval_results = run_ragas_evaluation(my_rag_pipeline, golden_questions)
with open("ragas_results.jsonl", "a") as f:
f.write(json.dumps(eval_results) + "\n")
print(f"Faithfulness: {eval_results['faithfulness']:.3f}")
print(f"Context Recall: {eval_results['context_recall']:.3f}")Track these metrics over time. Drops in faithfulness after a retrieval change indicate the new retrieval is confusing the LLM. Drops in context recall after re-chunking indicate important information is being split across chunk boundaries.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.