Advanced RAG · Lesson 14 of 14
Interview: Advanced RAG Design Scenarios
Q: What are the main failure modes of naive RAG?
1. Wrong documents retrieved:
- Query vocabulary mismatch with document vocabulary
- Rare medical terms not well-represented in embedding space
- Fix: hybrid retrieval (BM25 catches exact terms), query rewriting
2. Redundant documents retrieved:
- Multiple near-duplicate chunks about the same topic
- Wastes context window, model doesn't get diverse information
- Fix: MMR, deduplication, parent document retrieval
3. Context fragmentation:
- Key information split across chunk boundaries
- Retrieved chunk lacks context to interpret the answer
- Fix: parent document retrieval, small-to-big, larger chunks with overlap
4. Hallucination despite retrieval:
- Model adds information from its parametric knowledge beyond the context
- Grounding instructions not strong enough, or model too freely generates
- Fix: stronger grounding instructions, faithfulness-focused fine-tuning
5. Missing global context:
- Query requires synthesising across many documents
- Vector similarity retrieves local matches, misses the pattern
- Fix: GraphRAG, community summaries, summarisation-then-retrievalQ: How do you choose between hybrid retrieval and dense-only?
Run an ablation study on your specific corpus and query distribution. Dense-only works well when:
- Queries are semantic ("What causes atrial fibrillation?")
- The corpus vocabulary aligns well with the embedding model's training
- No rare technical terms (drug codes, specific gene variants)
Hybrid (add BM25) helps when:
- Queries include specific drug names, ICD codes, gene allele notation
- Users search with abbreviations (INR, AF, CKD)
- Your corpus has highly technical vocabulary
In practice, hybrid almost always matches or beats dense-only and costs minimally more — default to hybrid.
Q: Walk me through your RAG design for a clinical knowledge base.
Documents: clinical guidelines, formulary, protocols (structured)
Chunking: parent document retrieval
- Parent: section level (~1000 tokens)
- Child: paragraph level (~200 tokens)
- Child embeddings indexed; parent chunks returned
Embedding: clinical-domain model
- MedCPT or fine-tuned BioBERT embedding
- Outperforms general-purpose models on medical text
Retrieval: hybrid
- Dense vector search (Azure AI Search)
- BM25 keyword search for drug names, codes
- RRF fusion
Reranking: cross-encoder
- Cohere Rerank or MedCPT-Cross-Encoder
- Top-50 candidates → reranked → top-5 returned
Query transformation:
- Abbreviation expansion (INR → international normalised ratio)
- Lay-to-medical term conversion
- Multi-query for complex questions
Context compression:
- For long guidelines: extract only relevant paragraphs
- Reduces noise and context window usage
Generation: grounded with citations
- "According to [Section 4.2, Clinical Pharmacology]..."
- Output classifier: block dosage recommendations
Evaluation: RAGAS on 100-question test set
- Faithfulness > 0.85
- Context precision > 0.75Q: How would you reduce RAG latency for a real-time clinical chat application?
Latency budget: 2 seconds total (TTFT SLA)
Breakdown and optimisation:
Query rewriting: 100ms → cache common patterns, skip for simple queries
Embedding: 50ms → use local model (e.g., all-MiniLM), avoid API roundtrip
Vector search: 20-50ms → HNSW index, pre-built (no rebuild)
BM25: 10-30ms → in-memory index
Reranking: 100-300ms → biggest latency contributor
Optimise: reduce candidate pool from 50 to 20
Optimise: use smaller cross-encoder (MiniLM-6 not MiniLM-12)
Optimise: async/parallel execution with embedding step
Compression: 100-200ms → skip for short documents
LLM generation: 500-1500ms → stream tokens, show first token early
With streaming:
TTFT (time to first token) = retrieval + prompt assembly + LLM TTFT
≈ 300ms retrieval + 50ms assembly + 200ms LLM TTFT = ~550ms
User sees first token at 550ms — acceptable for interactive chatQ: How do you evaluate the retrieval component in isolation?
Use a retrieval-specific evaluation (not the end-to-end RAGAS score):
from typing import NamedTuple
class RetrievalEvalCase(NamedTuple):
query: str
relevant_doc_ids: list[str] # ground truth relevant documents
def evaluate_retrieval(retriever, eval_cases, k=5) -> dict:
precision_at_k = []
recall_at_k = []
mrr = []
for case in eval_cases:
retrieved_ids = [d["id"] for d in retriever(case.query, top_k=k)]
relevant_set = set(case.relevant_doc_ids)
# Precision@k: fraction of retrieved that are relevant
hits = [1 if doc_id in relevant_set else 0 for doc_id in retrieved_ids]
precision_at_k.append(sum(hits) / k)
# Recall@k: fraction of relevant that were retrieved
recall_at_k.append(sum(hits) / max(len(relevant_set), 1))
# MRR: reciprocal rank of first relevant result
for rank, is_hit in enumerate(hits, 1):
if is_hit:
mrr.append(1.0 / rank)
break
else:
mrr.append(0.0)
return {
f"precision@{k}": sum(precision_at_k) / len(eval_cases),
f"recall@{k}": sum(recall_at_k) / len(eval_cases),
"mrr": sum(mrr) / len(eval_cases)
}Q: When would GraphRAG be inappropriate for a clinical system?
GraphRAG is inappropriate when:
- The corpus changes frequently (drug updates, new guidelines) — graph rebuild is expensive
- Queries are simple factual lookups — graph overhead not justified
- Real-time latency is required — graph traversal adds latency
- Storage budget is constrained — graph, embeddings, and community summaries require 3× storage
- The clinical team doesn't have expertise to audit LLM-extracted relationships — errors in the knowledge graph can mislead retrieval
Better suited for: static medical reference corpora (drug interactions database, diagnostic guidelines) where multi-hop reasoning between entities is needed and the data changes infrequently.
Interview Answer Template
"Advanced RAG builds on naive RAG by addressing specific failure modes: hybrid retrieval (adding BM25 alongside dense embeddings) for exact term matching; reranking (cross-encoder) for precision; contextual compression for context window efficiency; parent document retrieval for chunk context; query rewriting for vocabulary bridging; MMR for diversity. Each adds latency and complexity — ablation studies on your specific domain and corpus determine which are worth it. In clinical RAG, reranking and hybrid retrieval almost always justify their cost; compression depends on document length. RAGAS provides end-to-end evaluation; retrieval-specific metrics (MRR, Recall@k) help diagnose where in the pipeline failures occur."