RAGAS: RAG Evaluation Framework

What RAGAS Is

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework for evaluating RAG pipelines. It provides four reference-free metrics that require no human labels:

Faithfulness:        Does the answer stick to the retrieved context?
Answer Relevance:    Does the answer address the question?
Context Precision:   Are the retrieved chunks relevant to the question?
Context Recall:      Does the retrieved context cover the answer?

"Reference-free" means RAGAS uses an LLM (typically GPT-4 or Claude) to compute the metrics — no manually labelled answer dataset required.

The Four Metrics in Detail

Faithfulness (0–1):
  Each factual claim in the answer is extracted
  Each claim is checked: is it supported by the retrieved context?
  Score = supported claims / total claims
  
  Example:
    Answer: "Warfarin is contraindicated in pregnancy. Dose is 5mg daily."
    Context: Only mentions contraindication in pregnancy (not dose)
    Faithfulness = 1/2 = 0.50  ← poor

Answer Relevance (0–1):
  Generate N reverse questions from the answer
  Compute similarity between those questions and the original query
  Score = mean cosine similarity
  
  Penalises: incomplete answers, off-topic answers

Context Precision (0–1):
  Ranks retrieved chunks by how useful they are to answering the question
  Score = weighted precision over the ranking
  High score: most useful chunks are ranked first
  
Context Recall (0–1):
  Requires a reference answer
  Checks: what fraction of reference answer claims are covered by retrieved context?
  Score = (claims in reference supported by context) / total claims in reference

Installation and Basic Usage

Python

# pip install ragas langchain anthropic

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Build evaluation dataset
samples = [
    {
        "question": "What is the INR target range for AF patients on warfarin?",
        "answer": "The INR target range for AF patients on warfarin is 2.0–3.0.",
        "contexts": [
            "For patients with atrial fibrillation on warfarin, the target INR is 2.0–3.0.",
            "INR monitoring should occur every 4 weeks when stable.",
        ],
        "ground_truth": "The target INR for AF patients on warfarin is 2.0–3.0.",
    },
    {
        "question": "What are the symptoms of warfarin overdose?",
        "answer": "Symptoms include unusual bleeding, bruising, and prolonged clotting.",
        "contexts": [
            "Warfarin toxicity presents with excessive bleeding from minor wounds.",
            "Signs of overdose include haematuria, epistaxis, and prolonged prothrombin time.",
        ],
        "ground_truth": "Warfarin overdose causes excessive bleeding, haematuria, epistaxis, and prolonged clotting time.",
    },
]

dataset = Dataset.from_list(samples)

# Run RAGAS evaluation
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85, 'context_recall': 0.79}

Using Claude as the Evaluator

Python

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings

# Use Claude for evaluation (instead of default GPT-4)
claude_llm = LangchainLLMWrapper(
    ChatAnthropic(model="claude-sonnet-4-6", temperature=0)
)
openai_embs = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=claude_llm,
    embeddings=openai_embs,
)

End-to-End RAG Evaluation Pipeline

Python

from dataclasses import dataclass
from typing import Callable

@dataclass
class RAGPipeline:
    retrieve: Callable[[str], list[str]]   # query → list of context strings
    generate: Callable[[str, list[str]], str]  # query + context → answer

def evaluate_rag_pipeline(
    pipeline: RAGPipeline,
    questions: list[str],
    ground_truths: list[str],
) -> dict:
    # Run pipeline on all questions
    answers = []
    contexts_list = []
    
    for question in questions:
        contexts = pipeline.retrieve(question)
        answer = pipeline.generate(question, contexts)
        answers.append(answer)
        contexts_list.append(contexts)
    
    dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts_list,
        "ground_truth": ground_truths,
    })
    
    result = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
    
    return {
        "faithfulness": result["faithfulness"],
        "answer_relevancy": result["answer_relevancy"],
        "context_precision": result["context_precision"],
        "context_recall": result["context_recall"],
        "mean_score": sum([
            result["faithfulness"],
            result["answer_relevancy"],
            result["context_precision"],
            result["context_recall"],
        ]) / 4,
    }

Interpreting RAGAS Results

Low faithfulness (< 0.8):
  The LLM is adding information not in the context
  Fix: strengthen grounding instructions ("ONLY use the provided context")
       lower temperature; add output classifier

Low answer relevancy (< 0.8):
  The answer is off-topic or incomplete
  Fix: improve prompt structure; add explicit "answer the question directly" instruction

Low context precision (< 0.7):
  Retrieved chunks contain irrelevant content
  Fix: improve embedding model; add metadata filters; increase similarity threshold

Low context recall (< 0.7):
  Retrieved context doesn't cover the answer
  Fix: add more documents; improve chunking; lower similarity threshold;
       add query expansion; use hybrid retrieval

Clinical minimum thresholds:
  Faithfulness: > 0.90 (safety-critical)
  Answer Relevance: > 0.85
  Context Precision: > 0.75
  Context Recall: > 0.80

Interview Answer

"RAGAS provides four reference-free RAG evaluation metrics: faithfulness (answer claims supported by context — the most safety-critical metric), answer relevancy (answer addresses the question), context precision (retrieved chunks are relevant, ranked well), and context recall (retrieved context covers the reference answer — the one metric requiring a reference). RAGAS uses an LLM to compute these automatically, removing the need for a manually labelled evaluation dataset. I run RAGAS in CI on a curated test set of 50–100 questions after any pipeline change: a faithfulness drop below 0.90 blocks deployment, as unfaithful clinical answers are a safety risk."

RAGAS: RAG Evaluation Framework

What RAGAS Is

The Four Metrics in Detail

Installation and Basic Usage

Using Claude as the Evaluator

End-to-End RAG Evaluation Pipeline

Interpreting RAGAS Results

Interview Answer

Enjoyed this article?

Leave a comment