Learnixo

Advanced RAG · Lesson 14 of 14

Interview: Advanced RAG Design Scenarios

Q: What are the main failure modes of naive RAG?

1. Wrong documents retrieved:
   - Query vocabulary mismatch with document vocabulary
   - Rare medical terms not well-represented in embedding space
   - Fix: hybrid retrieval (BM25 catches exact terms), query rewriting

2. Redundant documents retrieved:
   - Multiple near-duplicate chunks about the same topic
   - Wastes context window, model doesn't get diverse information
   - Fix: MMR, deduplication, parent document retrieval

3. Context fragmentation:
   - Key information split across chunk boundaries
   - Retrieved chunk lacks context to interpret the answer
   - Fix: parent document retrieval, small-to-big, larger chunks with overlap

4. Hallucination despite retrieval:
   - Model adds information from its parametric knowledge beyond the context
   - Grounding instructions not strong enough, or model too freely generates
   - Fix: stronger grounding instructions, faithfulness-focused fine-tuning

5. Missing global context:
   - Query requires synthesising across many documents
   - Vector similarity retrieves local matches, misses the pattern
   - Fix: GraphRAG, community summaries, summarisation-then-retrieval

Q: How do you choose between hybrid retrieval and dense-only?

Run an ablation study on your specific corpus and query distribution. Dense-only works well when:

  • Queries are semantic ("What causes atrial fibrillation?")
  • The corpus vocabulary aligns well with the embedding model's training
  • No rare technical terms (drug codes, specific gene variants)

Hybrid (add BM25) helps when:

  • Queries include specific drug names, ICD codes, gene allele notation
  • Users search with abbreviations (INR, AF, CKD)
  • Your corpus has highly technical vocabulary

In practice, hybrid almost always matches or beats dense-only and costs minimally more — default to hybrid.


Q: Walk me through your RAG design for a clinical knowledge base.

Documents: clinical guidelines, formulary, protocols (structured)

Chunking: parent document retrieval
  - Parent: section level (~1000 tokens)
  - Child: paragraph level (~200 tokens)
  - Child embeddings indexed; parent chunks returned

Embedding: clinical-domain model
  - MedCPT or fine-tuned BioBERT embedding
  - Outperforms general-purpose models on medical text

Retrieval: hybrid
  - Dense vector search (Azure AI Search)
  - BM25 keyword search for drug names, codes
  - RRF fusion

Reranking: cross-encoder
  - Cohere Rerank or MedCPT-Cross-Encoder
  - Top-50 candidates → reranked → top-5 returned

Query transformation:
  - Abbreviation expansion (INR → international normalised ratio)
  - Lay-to-medical term conversion
  - Multi-query for complex questions

Context compression:
  - For long guidelines: extract only relevant paragraphs
  - Reduces noise and context window usage

Generation: grounded with citations
  - "According to [Section 4.2, Clinical Pharmacology]..."
  - Output classifier: block dosage recommendations

Evaluation: RAGAS on 100-question test set
  - Faithfulness > 0.85
  - Context precision > 0.75

Q: How would you reduce RAG latency for a real-time clinical chat application?

Latency budget: 2 seconds total (TTFT SLA)

Breakdown and optimisation:
  Query rewriting: 100ms → cache common patterns, skip for simple queries
  Embedding: 50ms → use local model (e.g., all-MiniLM), avoid API roundtrip
  Vector search: 20-50ms → HNSW index, pre-built (no rebuild)
  BM25: 10-30ms → in-memory index
  Reranking: 100-300ms → biggest latency contributor
    Optimise: reduce candidate pool from 50 to 20
    Optimise: use smaller cross-encoder (MiniLM-6 not MiniLM-12)
    Optimise: async/parallel execution with embedding step
  Compression: 100-200ms → skip for short documents
  LLM generation: 500-1500ms → stream tokens, show first token early

With streaming:
  TTFT (time to first token) = retrieval + prompt assembly + LLM TTFT
  ≈ 300ms retrieval + 50ms assembly + 200ms LLM TTFT = ~550ms
  User sees first token at 550ms — acceptable for interactive chat

Q: How do you evaluate the retrieval component in isolation?

Use a retrieval-specific evaluation (not the end-to-end RAGAS score):

Python
from typing import NamedTuple

class RetrievalEvalCase(NamedTuple):
    query: str
    relevant_doc_ids: list[str]  # ground truth relevant documents

def evaluate_retrieval(retriever, eval_cases, k=5) -> dict:
    precision_at_k = []
    recall_at_k = []
    mrr = []

    for case in eval_cases:
        retrieved_ids = [d["id"] for d in retriever(case.query, top_k=k)]
        relevant_set = set(case.relevant_doc_ids)

        # Precision@k: fraction of retrieved that are relevant
        hits = [1 if doc_id in relevant_set else 0 for doc_id in retrieved_ids]
        precision_at_k.append(sum(hits) / k)

        # Recall@k: fraction of relevant that were retrieved
        recall_at_k.append(sum(hits) / max(len(relevant_set), 1))

        # MRR: reciprocal rank of first relevant result
        for rank, is_hit in enumerate(hits, 1):
            if is_hit:
                mrr.append(1.0 / rank)
                break
        else:
            mrr.append(0.0)

    return {
        f"precision@{k}": sum(precision_at_k) / len(eval_cases),
        f"recall@{k}": sum(recall_at_k) / len(eval_cases),
        "mrr": sum(mrr) / len(eval_cases)
    }

Q: When would GraphRAG be inappropriate for a clinical system?

GraphRAG is inappropriate when:

  • The corpus changes frequently (drug updates, new guidelines) — graph rebuild is expensive
  • Queries are simple factual lookups — graph overhead not justified
  • Real-time latency is required — graph traversal adds latency
  • Storage budget is constrained — graph, embeddings, and community summaries require 3× storage
  • The clinical team doesn't have expertise to audit LLM-extracted relationships — errors in the knowledge graph can mislead retrieval

Better suited for: static medical reference corpora (drug interactions database, diagnostic guidelines) where multi-hop reasoning between entities is needed and the data changes infrequently.


Interview Answer Template

"Advanced RAG builds on naive RAG by addressing specific failure modes: hybrid retrieval (adding BM25 alongside dense embeddings) for exact term matching; reranking (cross-encoder) for precision; contextual compression for context window efficiency; parent document retrieval for chunk context; query rewriting for vocabulary bridging; MMR for diversity. Each adds latency and complexity — ablation studies on your specific domain and corpus determine which are worth it. In clinical RAG, reranking and hybrid retrieval almost always justify their cost; compression depends on document length. RAGAS provides end-to-end evaluation; retrieval-specific metrics (MRR, Recall@k) help diagnose where in the pipeline failures occur."