Learnixo

Advanced RAG · Lesson 13 of 14

Ablation Study: Which Component Helps Most?

What Is an Ablation Study?

An ablation study removes or disables one component at a time to measure its contribution to overall performance:

Full system: hybrid retrieval + reranking + contextual compression + parent document
  → RAGAS score: 0.85

Remove reranking → keep everything else:
  → RAGAS score: 0.79  (−0.06 = reranking contributes 0.06)

Remove compression → keep everything else:
  → RAGAS score: 0.83  (−0.02 = compression contributes 0.02)

Remove parent document → use flat chunking:
  → RAGAS score: 0.81  (−0.04 = parent document contributes 0.04)

Conclusion: reranking is the most important component.
Don't remove it. Compression matters less.

RAG Component Ablation Matrix

Python
from dataclasses import dataclass
from typing import Callable

@dataclass
class RAGConfig:
    name: str
    use_hybrid: bool = True
    use_reranking: bool = True
    use_compression: bool = True
    use_parent_doc: bool = True
    use_query_rewriting: bool = True
    chunk_size: int = 512
    top_k: int = 5

# Define all ablation configurations
configs = [
    RAGConfig("full_system"),
    RAGConfig("no_reranking", use_reranking=False),
    RAGConfig("no_hybrid", use_hybrid=False),
    RAGConfig("no_compression", use_compression=False),
    RAGConfig("no_parent_doc", use_parent_doc=False),
    RAGConfig("no_query_rewriting", use_query_rewriting=False),
    RAGConfig("sparse_only", use_hybrid=False),
    RAGConfig("small_chunks", chunk_size=128),
    RAGConfig("large_chunks", chunk_size=1024),
    RAGConfig("top_3", top_k=3),
    RAGConfig("top_10", top_k=10),
]

Running the Ablation

Python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def run_ablation(
    configs: list[RAGConfig],
    eval_questions: list[dict],
    build_rag_fn: Callable[[RAGConfig], Callable],
) -> dict[str, dict]:
    results = {}

    for config in configs:
        print(f"Running: {config.name}")

        # Build the RAG pipeline for this config
        rag_fn = build_rag_fn(config)

        # Run all eval questions
        samples = []
        for q in eval_questions:
            result = rag_fn(q["question"])
            samples.append({
                "question": q["question"],
                "answer": result["answer"],
                "contexts": result["contexts"],
                "ground_truth": q["ground_truth"]
            })

        # Score
        dataset = Dataset.from_list(samples)
        scores = evaluate(dataset, metrics=[
            faithfulness, answer_relevancy, context_precision, context_recall
        ])
        results[config.name] = dict(scores)

    return results

def print_ablation_table(results: dict[str, dict]) -> None:
    headers = ["Config", "Faithfulness", "Relevance", "Precision", "Recall"]
    print(" | ".join(f"{h:20}" for h in headers))
    print("-" * 90)
    baseline = results.get("full_system", {})
    for config_name, scores in results.items():
        row = [config_name]
        for metric in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
            val = scores.get(metric, 0)
            baseline_val = baseline.get(metric, 0)
            delta = f"({val - baseline_val:+.3f})" if config_name != "full_system" else ""
            row.append(f"{val:.3f} {delta}")
        print(" | ".join(f"{r:20}" for r in row))

What to Ablate and What It Reveals

Ablation → What it reveals
─────────────────────────────────────────────────────────────────
Remove reranking:       How much of the precision comes from reranking
                        vs from the initial retrieval?
Remove hybrid retrieval: Do we actually benefit from BM25 in our domain?
                        (for highly semantic queries, BM25 may add noise)
Remove compression:     Does contextual compression actually help, or
                        does it drop important nuance?
Remove parent doc:      Is full-context retrieval worth the storage cost?
Remove query rewriting: How much do users benefit from query expansion?
                        (medical abbreviations → clinical terms)
Vary chunk size:        What's the optimal granularity for our corpus?
Vary top_k:             Does more context help or hurt (noise vs coverage)?

Common Ablation Findings for Clinical RAG

Typical findings in clinical document RAG:

1. Reranking gives the biggest single improvement
   +0.08-0.15 faithfulness and precision
   Worth the latency and cost for clinical use cases

2. Hybrid retrieval helps significantly
   +0.05-0.10 recall for medical abbreviations and specific drug names

3. Query rewriting matters a lot for lay user queries
   Less important for clinical staff queries (already use correct terminology)

4. Contextual compression helps for long documents
   Clinical guidelines (50+ pages): helps significantly
   EHR notes (1-5 pages): less important

5. Parent document vs flat chunking
   Structured guidelines: parent document is significantly better
   Unstructured notes: difference is smaller

6. chunk_size = 300-500 tokens is often optimal for clinical text
   Small enough for precision, large enough for context

Interview Answer

"Ablation studies systematically disable one RAG component at a time and measure the RAGAS score delta — this tells you which components contribute most to performance. The methodology: build a full pipeline, run your eval set, then remove one component and re-run. The delta is the contribution of that component. In clinical RAG, typical findings are: reranking gives the largest single improvement (+0.08-0.15 on faithfulness/precision); hybrid retrieval helps significantly for medical abbreviations; query rewriting helps for lay users; chunk size around 300-500 tokens is usually optimal. Ablation studies prevent over-engineering — you don't need every advanced technique if the simple baseline already works well for your data."