Interview: Advanced RAG Design Scenarios — Advanced RAG | Learnixo

Q: What are the main failure modes of naive RAG?

1. Wrong documents retrieved:
   - Query vocabulary mismatch with document vocabulary
   - Rare medical terms not well-represented in embedding space
   - Fix: hybrid retrieval (BM25 catches exact terms), query rewriting

2. Redundant documents retrieved:
   - Multiple near-duplicate chunks about the same topic
   - Wastes context window, model doesn't get diverse information
   - Fix: MMR, deduplication, parent document retrieval

3. Context fragmentation:
   - Key information split across chunk boundaries
   - Retrieved chunk lacks context to interpret the answer
   - Fix: parent document retrieval, small-to-big, larger chunks with overlap

4. Hallucination despite retrieval:
   - Model adds information from its parametric knowledge beyond the context
   - Grounding instructions not strong enough, or model too freely generates
   - Fix: stronger grounding instructions, faithfulness-focused fine-tuning

5. Missing global context:
   - Query requires synthesising across many documents
   - Vector similarity retrieves local matches, misses the pattern
   - Fix: GraphRAG, community summaries, summarisation-then-retrieval

Q: How do you choose between hybrid retrieval and dense-only?

Run an ablation study on your specific corpus and query distribution. Dense-only works well when:

Queries are semantic ("What causes atrial fibrillation?")
The corpus vocabulary aligns well with the embedding model's training
No rare technical terms (drug codes, specific gene variants)

Hybrid (add BM25) helps when:

Queries include specific drug names, ICD codes, gene allele notation
Users search with abbreviations (INR, AF, CKD)
Your corpus has highly technical vocabulary

In practice, hybrid almost always matches or beats dense-only and costs minimally more — default to hybrid.

Q: Walk me through your RAG design for a clinical knowledge base.

Documents: clinical guidelines, formulary, protocols (structured)

Chunking: parent document retrieval
  - Parent: section level (~1000 tokens)
  - Child: paragraph level (~200 tokens)
  - Child embeddings indexed; parent chunks returned

Embedding: clinical-domain model
  - MedCPT or fine-tuned BioBERT embedding
  - Outperforms general-purpose models on medical text

Retrieval: hybrid
  - Dense vector search (Azure AI Search)
  - BM25 keyword search for drug names, codes
  - RRF fusion

Reranking: cross-encoder
  - Cohere Rerank or MedCPT-Cross-Encoder
  - Top-50 candidates → reranked → top-5 returned

Query transformation:
  - Abbreviation expansion (INR → international normalised ratio)
  - Lay-to-medical term conversion
  - Multi-query for complex questions

Context compression:
  - For long guidelines: extract only relevant paragraphs
  - Reduces noise and context window usage

Generation: grounded with citations
  - "According to [Section 4.2, Clinical Pharmacology]..."
  - Output classifier: block dosage recommendations

Evaluation: RAGAS on 100-question test set
  - Faithfulness > 0.85
  - Context precision > 0.75

Q: How would you reduce RAG latency for a real-time clinical chat application?

Latency budget: 2 seconds total (TTFT SLA)

Breakdown and optimisation:
  Query rewriting: 100ms → cache common patterns, skip for simple queries
  Embedding: 50ms → use local model (e.g., all-MiniLM), avoid API roundtrip
  Vector search: 20-50ms → HNSW index, pre-built (no rebuild)
  BM25: 10-30ms → in-memory index
  Reranking: 100-300ms → biggest latency contributor
    Optimise: reduce candidate pool from 50 to 20
    Optimise: use smaller cross-encoder (MiniLM-6 not MiniLM-12)
    Optimise: async/parallel execution with embedding step
  Compression: 100-200ms → skip for short documents
  LLM generation: 500-1500ms → stream tokens, show first token early

With streaming:
  TTFT (time to first token) = retrieval + prompt assembly + LLM TTFT
  ≈ 300ms retrieval + 50ms assembly + 200ms LLM TTFT = ~550ms
  User sees first token at 550ms — acceptable for interactive chat

Q: How do you evaluate the retrieval component in isolation?

Use a retrieval-specific evaluation (not the end-to-end RAGAS score):

Python

from typing import NamedTuple

class RetrievalEvalCase(NamedTuple):
    query: str
    relevant_doc_ids: list[str]  # ground truth relevant documents

def evaluate_retrieval(retriever, eval_cases, k=5) -> dict:
    precision_at_k = []
    recall_at_k = []
    mrr = []

    for case in eval_cases:
        retrieved_ids = [d["id"] for d in retriever(case.query, top_k=k)]
        relevant_set = set(case.relevant_doc_ids)

        # Precision@k: fraction of retrieved that are relevant
        hits = [1 if doc_id in relevant_set else 0 for doc_id in retrieved_ids]
        precision_at_k.append(sum(hits) / k)

        # Recall@k: fraction of relevant that were retrieved
        recall_at_k.append(sum(hits) / max(len(relevant_set), 1))

        # MRR: reciprocal rank of first relevant result
        for rank, is_hit in enumerate(hits, 1):
            if is_hit:
                mrr.append(1.0 / rank)
                break
        else:
            mrr.append(0.0)

    return {
        f"precision@{k}": sum(precision_at_k) / len(eval_cases),
        f"recall@{k}": sum(recall_at_k) / len(eval_cases),
        "mrr": sum(mrr) / len(eval_cases)
    }

Q: When would GraphRAG be inappropriate for a clinical system?

GraphRAG is inappropriate when:

The corpus changes frequently (drug updates, new guidelines) — graph rebuild is expensive
Queries are simple factual lookups — graph overhead not justified
Real-time latency is required — graph traversal adds latency
Storage budget is constrained — graph, embeddings, and community summaries require 3× storage
The clinical team doesn't have expertise to audit LLM-extracted relationships — errors in the knowledge graph can mislead retrieval

Better suited for: static medical reference corpora (drug interactions database, diagnostic guidelines) where multi-hop reasoning between entities is needed and the data changes infrequently.

Interview Answer Template

"Advanced RAG builds on naive RAG by addressing specific failure modes: hybrid retrieval (adding BM25 alongside dense embeddings) for exact term matching; reranking (cross-encoder) for precision; contextual compression for context window efficiency; parent document retrieval for chunk context; query rewriting for vocabulary bridging; MMR for diversity. Each adds latency and complexity — ablation studies on your specific domain and corpus determine which are worth it. In clinical RAG, reranking and hybrid retrieval almost always justify their cost; compression depends on document length. RAGAS provides end-to-end evaluation; retrieval-specific metrics (MRR, Recall@k) help diagnose where in the pipeline failures occur."