RAG Ablation Studies
How to systematically test which RAG components matter most ā ablation methodology, what to test, and how to interpret results to guide architectural decisions.
What Is an Ablation Study?
An ablation study removes or disables one component at a time to measure its contribution to overall performance:
Full system: hybrid retrieval + reranking + contextual compression + parent document
ā RAGAS score: 0.85
Remove reranking ā keep everything else:
ā RAGAS score: 0.79 (ā0.06 = reranking contributes 0.06)
Remove compression ā keep everything else:
ā RAGAS score: 0.83 (ā0.02 = compression contributes 0.02)
Remove parent document ā use flat chunking:
ā RAGAS score: 0.81 (ā0.04 = parent document contributes 0.04)
Conclusion: reranking is the most important component.
Don't remove it. Compression matters less.RAG Component Ablation Matrix
from dataclasses import dataclass
from typing import Callable
@dataclass
class RAGConfig:
name: str
use_hybrid: bool = True
use_reranking: bool = True
use_compression: bool = True
use_parent_doc: bool = True
use_query_rewriting: bool = True
chunk_size: int = 512
top_k: int = 5
# Define all ablation configurations
configs = [
RAGConfig("full_system"),
RAGConfig("no_reranking", use_reranking=False),
RAGConfig("no_hybrid", use_hybrid=False),
RAGConfig("no_compression", use_compression=False),
RAGConfig("no_parent_doc", use_parent_doc=False),
RAGConfig("no_query_rewriting", use_query_rewriting=False),
RAGConfig("sparse_only", use_hybrid=False),
RAGConfig("small_chunks", chunk_size=128),
RAGConfig("large_chunks", chunk_size=1024),
RAGConfig("top_3", top_k=3),
RAGConfig("top_10", top_k=10),
]Running the Ablation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
def run_ablation(
configs: list[RAGConfig],
eval_questions: list[dict],
build_rag_fn: Callable[[RAGConfig], Callable],
) -> dict[str, dict]:
results = {}
for config in configs:
print(f"Running: {config.name}")
# Build the RAG pipeline for this config
rag_fn = build_rag_fn(config)
# Run all eval questions
samples = []
for q in eval_questions:
result = rag_fn(q["question"])
samples.append({
"question": q["question"],
"answer": result["answer"],
"contexts": result["contexts"],
"ground_truth": q["ground_truth"]
})
# Score
dataset = Dataset.from_list(samples)
scores = evaluate(dataset, metrics=[
faithfulness, answer_relevancy, context_precision, context_recall
])
results[config.name] = dict(scores)
return results
def print_ablation_table(results: dict[str, dict]) -> None:
headers = ["Config", "Faithfulness", "Relevance", "Precision", "Recall"]
print(" | ".join(f"{h:20}" for h in headers))
print("-" * 90)
baseline = results.get("full_system", {})
for config_name, scores in results.items():
row = [config_name]
for metric in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
val = scores.get(metric, 0)
baseline_val = baseline.get(metric, 0)
delta = f"({val - baseline_val:+.3f})" if config_name != "full_system" else ""
row.append(f"{val:.3f} {delta}")
print(" | ".join(f"{r:20}" for r in row))What to Ablate and What It Reveals
Ablation ā What it reveals
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Remove reranking: How much of the precision comes from reranking
vs from the initial retrieval?
Remove hybrid retrieval: Do we actually benefit from BM25 in our domain?
(for highly semantic queries, BM25 may add noise)
Remove compression: Does contextual compression actually help, or
does it drop important nuance?
Remove parent doc: Is full-context retrieval worth the storage cost?
Remove query rewriting: How much do users benefit from query expansion?
(medical abbreviations ā clinical terms)
Vary chunk size: What's the optimal granularity for our corpus?
Vary top_k: Does more context help or hurt (noise vs coverage)?Common Ablation Findings for Clinical RAG
Typical findings in clinical document RAG:
1. Reranking gives the biggest single improvement
+0.08-0.15 faithfulness and precision
Worth the latency and cost for clinical use cases
2. Hybrid retrieval helps significantly
+0.05-0.10 recall for medical abbreviations and specific drug names
3. Query rewriting matters a lot for lay user queries
Less important for clinical staff queries (already use correct terminology)
4. Contextual compression helps for long documents
Clinical guidelines (50+ pages): helps significantly
EHR notes (1-5 pages): less important
5. Parent document vs flat chunking
Structured guidelines: parent document is significantly better
Unstructured notes: difference is smaller
6. chunk_size = 300-500 tokens is often optimal for clinical text
Small enough for precision, large enough for contextInterview Answer
"Ablation studies systematically disable one RAG component at a time and measure the RAGAS score delta ā this tells you which components contribute most to performance. The methodology: build a full pipeline, run your eval set, then remove one component and re-run. The delta is the contribution of that component. In clinical RAG, typical findings are: reranking gives the largest single improvement (+0.08-0.15 on faithfulness/precision); hybrid retrieval helps significantly for medical abbreviations; query rewriting helps for lay users; chunk size around 300-500 tokens is usually optimal. Ablation studies prevent over-engineering ā you don't need every advanced technique if the simple baseline already works well for your data."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.