Learnixo

RAG Systems · Lesson 24 of 24

Interview: Design a Production RAG System

RAG in Production — Senior Interview Q&A

These questions go beyond "what is RAG" into the decisions that determine whether a RAG system works reliably in production. They are common in senior AI engineer and ML engineer interviews.


Q1. Design a production RAG system for a clinical guideline assistant.

Structure it in three layers. First, ingestion: guidelines arrive as PDFs, parsed by pdfplumber, converted to markdown, chunked recursively on section headers (512 tokens, 64 overlap), embedded with a biomedical embedding model, and stored in a vector store with metadata covering guideline_id, version, publish_date, and topic_tags. Second, retrieval: hybrid search (dense vector plus BM25 for exact drug names and codes), fused with Reciprocal Rank Fusion, reranked with a cross-encoder (top-5 from initial top-20), similarity threshold at 0.65. Third, generation: a capable model with a hardened system prompt ("answer ONLY from the provided context, cite the source section"), temperature 0, max_tokens 800. Monitoring: log every query, retrieved chunks, and answer; compute faithfulness nightly on a sample; alert if faithfulness drops below 0.90 or retrieval max-similarity drops below 0.60 — the latter signals a knowledge gap. Security: no patient data in this system — pure clinical knowledge, no PHI.


Q2. How do you handle document updates? A clinical guideline is revised monthly.

Differential re-indexing: each document gets a content hash at index time. On update, compare the new document hash to the stored hash. If changed, soft-delete the old chunks by marking them superseded=True with a timestamp rather than hard-deleting — the audit trail is critical in clinical systems. Re-embed and re-index only the changed document. New queries filter for superseded=False. Monthly, run a freshness audit: any document past its declared expiry date triggers an alert to the knowledge base maintainer. For high-update environments such as drug formularies updated daily, use a change-data-capture pipeline where the source system webhooks the indexing pipeline on every document change, making the knowledge base near real-time rather than batch.


Q3. Retrieval is fast but the LLM response is too slow. How do you optimise?

Latency budget breakdown: embedding the query takes roughly 20ms, HNSW search takes roughly 5ms, and LLM generation takes 1 to 3 seconds — the clear bottleneck. Strategies in priority order: First, enable streaming — return partial responses as they generate; users perceive much lower latency. Second, reduce context: contextual compression extracts only the relevant sentences from each chunk before sending to the LLM, shrinking the prompt by 50 to 70% and cutting generation time proportionally. Third, use a smaller model for triage: classify whether the query needs the full knowledge base at all — simple greetings and clearly off-topic queries skip retrieval entirely. Fourth, cache: hash the query plus top-3 chunk IDs and cache responses in Redis with a 1-hour TTL — repeated questions hit cache. Fifth, if sub-200ms is truly required, consider a fine-tuned smaller model for the most common query types.


Q4. How do you prevent the LLM from hallucinating beyond the retrieved context?

Defence in depth. First, the system prompt: "ONLY use information from the CONTEXT section below. If the answer is not in the CONTEXT, say exactly: The provided guidelines do not contain this information." Second, temperature 0 reduces creative generation. Third, output validation: a fast LLM judge checks whether each claim in the answer is present in the retrieved context and flags absent claims. Fourth, faithfulness in CI — any pipeline change that drops faithfulness below 0.90 is blocked from deployment. Fifth, user-visible citations: every answer must cite which guideline section it draws from, making faithfulness violations visible to the clinician who can then verify against the source. No single layer is sufficient — all five are needed for clinical-grade reliability.


Q5. A junior developer asks: "Why not just put all the guidelines in the context window instead of using RAG?"

Two problems. First, cost: a typical hospital guideline set is 10,000 or more pages. At roughly 4 characters per token, that is 2.5 million tokens. At current frontier model pricing, every query costs around $7.50 just for input tokens — completely impractical at scale. Second, quality: LLMs suffer from the lost-in-the-middle problem — they attend much less to information in the middle of very long contexts, so even if you could fit everything, retrieval quality degrades. RAG retrieves the 5 most relevant chunks (roughly 2,000 tokens) and the model attends to all of them effectively. The full-context approach is valid for small document sets under roughly 50 pages, a technique sometimes called full-context RAG, but for production knowledge bases with hundreds of guidelines, vector retrieval is essential.


Q6. What do you monitor in production to catch RAG failures?

Five key signals. One: max retrieval similarity — the cosine similarity of the best retrieved chunk. If below 0.60, log it as a knowledge gap candidate. Two: "I don't have information" response rate — if this rises above 5%, the knowledge base has gaps. Three: faithfulness score, computed on a sampled 1 to 5% of queries asynchronously by the evaluation pipeline. Four: user feedback — explicit thumbs-down and implicit signals such as query rephrasing and follow-up questions that suggest the first answer was wrong. Five: latency percentiles (P50, P95, P99) — spikes indicate vector store or LLM issues. Alerts: faithfulness below 0.90, knowledge gap rate above 10%, P95 latency above 5 seconds. Review the knowledge gap log weekly to identify which topics need more documents indexed.


Q7. How do you handle multi-turn conversations in a RAG system?

Two problems: conversational reference resolution and context accumulation. Resolution: if the user says "What about the dosing for elderly patients?" on turn 3, "elderly patients" refers to the drug discussed in turn 1. Use query rewriting: the LLM receives the last 3 turns of conversation and rewrites the current query to be self-contained before retrieval. Context accumulation: do not re-retrieve on every turn. Maintain a sliding window of retrieved chunks across the conversation, adding new retrieved chunks only when the query shifts topic — detected by a significant drop in embedding similarity versus the current context embedding. The prompt includes both the accumulated context and the current query. This keeps the context window predictable and avoids re-embedding the same chunks repeatedly.


Q8. How do you evaluate a RAG system before deploying a change to production?

Build a labelled eval dataset of 50 to 100 question, context, ground-truth answer triples. Measure three metrics: context recall (did retrieval find the right chunk?), faithfulness (does the answer stay within the retrieved context?), and answer correctness (semantic similarity of the answer to ground truth). Run this as automated tests in CI — fail the pipeline if faithfulness drops below 0.90 or context recall drops below 0.75. Track the metrics over time to detect drift. For a clinical system, also run a domain expert review on a sample of 20 answers before any major system change. Automated metrics catch quantitative regressions; human review catches nuanced failures like technically correct but clinically misleading answers.


Q9. What is Reciprocal Rank Fusion and why is it used in hybrid RAG?

Reciprocal Rank Fusion (RRF) merges ranked result lists from multiple retrieval methods (dense vector search and BM25 keyword search) into a single ranked list. Each document gets a score based on its rank in each individual list: 1 / (k + rank) where k is typically 60. The scores from each retrieval method are summed to produce the final ranking. RRF is used because the scores from dense and sparse retrieval are not directly comparable — dense search returns cosine similarities, BM25 returns TF-IDF scores. RRF treats both as relative orderings rather than absolute scores, which makes them directly combinable. In practice, hybrid search with RRF outperforms either method alone, especially for queries containing rare proper nouns (where BM25 excels) mixed with conceptual questions (where dense search excels).


Q10. How do you handle a query that has no answer in the knowledge base?

At retrieval time, check the similarity score of the best retrieved chunk against a threshold. If the top chunk has similarity below 0.65, skip the LLM call entirely and return a predefined response: "I don't have information about this in the available documents — please consult the source directly." This has two benefits: it avoids a hallucinated answer and it saves the cost of an LLM call. For queries that pass the threshold but the LLM still cannot find a relevant answer in the provided context, the system prompt instructs the model to say so explicitly rather than generating from training knowledge. Track both types of "I don't know" separately in monitoring: threshold-blocked queries are knowledge gaps; LLM-acknowledged-no-answer queries may indicate chunking or retrieval issues where the right content exists but was not retrieved.


Q11. What is contextual compression and how does it improve RAG?

Contextual compression is the step between retrieval and generation where retrieved chunks are filtered and shortened to contain only the sentences relevant to the specific query. Instead of passing 5 full chunks (potentially 2,500 tokens) to the LLM, a compression model extracts only the relevant sentences from each chunk, reducing context to maybe 500 tokens. This improves generation quality because the model sees only the relevant content rather than surrounding noise. It also reduces cost and latency proportionally to the compression ratio. The tradeoff is that compression adds latency (a second LLM call) and can occasionally discard relevant content. It works best when chunks are large and heterogeneous. For small, precise chunks, the benefit diminishes.


Q12. How would you build a RAG system that cites its sources?

At retrieval time, store the source metadata — document title, section heading, page number, URL — with each retrieved chunk. When passing chunks to the LLM, include the source identifier in the context: "The following is from Section 4.2 of the Warfarin Dosing Guidelines (2025):" followed by the chunk text. In the system prompt, instruct the model to include the source in the answer: "Always cite the section name and document title where you found each piece of information." After generation, parse the citations from the response and validate that each cited source is actually in the retrieved context — if the model invents a citation, that is a hallucination. Expose the citations in the UI so users can click through to the original document. This creates accountability and allows users to verify answers independently.


Interview Answer Summary

Production RAG architecture: hybrid retrieval (dense plus BM25 fused with RRF) into a cross-encoder reranker into a similarity threshold filter into a hardened generation prompt into an output faithfulness check. Reliability: differential re-indexing with soft-delete on updates, freshness metadata with expiry alerts. Performance: streaming plus contextual compression plus response caching. Monitoring: retrieval similarity as a knowledge gap signal, faithfulness sampling, user feedback, latency percentiles. Clinical safety: faithfulness above 0.90 as a hard deployment gate; citations for every answer; patient data isolation with metadata filters at the vector store level. Evaluation: automated context recall and faithfulness in CI, human expert review before major changes.