Learnixo
Back to blog
AI Systemsintermediate

RAG Ablation Studies

How to systematically test which RAG components matter most — ablation methodology, what to test, and how to interpret results to guide architectural decisions.

Asma Hafeez KhanMay 16, 20264 min read
RAGAblationEvaluationMethodologyInterview
Share:š•

What Is an Ablation Study?

An ablation study removes or disables one component at a time to measure its contribution to overall performance:

Full system: hybrid retrieval + reranking + contextual compression + parent document
  → RAGAS score: 0.85

Remove reranking → keep everything else:
  → RAGAS score: 0.79  (āˆ’0.06 = reranking contributes 0.06)

Remove compression → keep everything else:
  → RAGAS score: 0.83  (āˆ’0.02 = compression contributes 0.02)

Remove parent document → use flat chunking:
  → RAGAS score: 0.81  (āˆ’0.04 = parent document contributes 0.04)

Conclusion: reranking is the most important component.
Don't remove it. Compression matters less.

RAG Component Ablation Matrix

Python
from dataclasses import dataclass
from typing import Callable

@dataclass
class RAGConfig:
    name: str
    use_hybrid: bool = True
    use_reranking: bool = True
    use_compression: bool = True
    use_parent_doc: bool = True
    use_query_rewriting: bool = True
    chunk_size: int = 512
    top_k: int = 5

# Define all ablation configurations
configs = [
    RAGConfig("full_system"),
    RAGConfig("no_reranking", use_reranking=False),
    RAGConfig("no_hybrid", use_hybrid=False),
    RAGConfig("no_compression", use_compression=False),
    RAGConfig("no_parent_doc", use_parent_doc=False),
    RAGConfig("no_query_rewriting", use_query_rewriting=False),
    RAGConfig("sparse_only", use_hybrid=False),
    RAGConfig("small_chunks", chunk_size=128),
    RAGConfig("large_chunks", chunk_size=1024),
    RAGConfig("top_3", top_k=3),
    RAGConfig("top_10", top_k=10),
]

Running the Ablation

Python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def run_ablation(
    configs: list[RAGConfig],
    eval_questions: list[dict],
    build_rag_fn: Callable[[RAGConfig], Callable],
) -> dict[str, dict]:
    results = {}

    for config in configs:
        print(f"Running: {config.name}")

        # Build the RAG pipeline for this config
        rag_fn = build_rag_fn(config)

        # Run all eval questions
        samples = []
        for q in eval_questions:
            result = rag_fn(q["question"])
            samples.append({
                "question": q["question"],
                "answer": result["answer"],
                "contexts": result["contexts"],
                "ground_truth": q["ground_truth"]
            })

        # Score
        dataset = Dataset.from_list(samples)
        scores = evaluate(dataset, metrics=[
            faithfulness, answer_relevancy, context_precision, context_recall
        ])
        results[config.name] = dict(scores)

    return results

def print_ablation_table(results: dict[str, dict]) -> None:
    headers = ["Config", "Faithfulness", "Relevance", "Precision", "Recall"]
    print(" | ".join(f"{h:20}" for h in headers))
    print("-" * 90)
    baseline = results.get("full_system", {})
    for config_name, scores in results.items():
        row = [config_name]
        for metric in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
            val = scores.get(metric, 0)
            baseline_val = baseline.get(metric, 0)
            delta = f"({val - baseline_val:+.3f})" if config_name != "full_system" else ""
            row.append(f"{val:.3f} {delta}")
        print(" | ".join(f"{r:20}" for r in row))

What to Ablate and What It Reveals

Ablation → What it reveals
─────────────────────────────────────────────────────────────────
Remove reranking:       How much of the precision comes from reranking
                        vs from the initial retrieval?
Remove hybrid retrieval: Do we actually benefit from BM25 in our domain?
                        (for highly semantic queries, BM25 may add noise)
Remove compression:     Does contextual compression actually help, or
                        does it drop important nuance?
Remove parent doc:      Is full-context retrieval worth the storage cost?
Remove query rewriting: How much do users benefit from query expansion?
                        (medical abbreviations → clinical terms)
Vary chunk size:        What's the optimal granularity for our corpus?
Vary top_k:             Does more context help or hurt (noise vs coverage)?

Common Ablation Findings for Clinical RAG

Typical findings in clinical document RAG:

1. Reranking gives the biggest single improvement
   +0.08-0.15 faithfulness and precision
   Worth the latency and cost for clinical use cases

2. Hybrid retrieval helps significantly
   +0.05-0.10 recall for medical abbreviations and specific drug names

3. Query rewriting matters a lot for lay user queries
   Less important for clinical staff queries (already use correct terminology)

4. Contextual compression helps for long documents
   Clinical guidelines (50+ pages): helps significantly
   EHR notes (1-5 pages): less important

5. Parent document vs flat chunking
   Structured guidelines: parent document is significantly better
   Unstructured notes: difference is smaller

6. chunk_size = 300-500 tokens is often optimal for clinical text
   Small enough for precision, large enough for context

Interview Answer

"Ablation studies systematically disable one RAG component at a time and measure the RAGAS score delta — this tells you which components contribute most to performance. The methodology: build a full pipeline, run your eval set, then remove one component and re-run. The delta is the contribution of that component. In clinical RAG, typical findings are: reranking gives the largest single improvement (+0.08-0.15 on faithfulness/precision); hybrid retrieval helps significantly for medical abbreviations; query rewriting helps for lay users; chunk size around 300-500 tokens is usually optimal. Ablation studies prevent over-engineering — you don't need every advanced technique if the simple baseline already works well for your data."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.