GenAI & LLM Interviews · Lesson 11 of 30
Interview: RAG Systems (Part 1)
Q1: Explain how vector similarity search works and why cosine similarity is preferred over Euclidean distance for text embeddings.
Answer:
Vector similarity search maps text to a high-dimensional vector space where semantic similarity corresponds to geometric proximity. When you embed "warfarin drug interaction" and "warfarin medication conflict," both map to nearby vectors because they share meaning.
Cosine similarity vs Euclidean distance:
Cosine similarity measures the angle between vectors:
cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)Euclidean distance measures geometric distance:
euclidean(a, b) = sqrt(sum((a_i - b_i)^2))For text embeddings, cosine is preferred because:
-
Magnitude invariance: A short sentence and a long document about the same topic may have different vector magnitudes (due to averaging or normalization), but their directions (angles) are similar. Cosine captures this; Euclidean would penalize the magnitude difference.
-
Normalization consistency: OpenAI and most embedding APIs return L2-normalized vectors (magnitude = 1). For unit vectors, cosine similarity = dot product, so the math simplifies to a fast matrix multiplication.
-
Semantic geometry: Embedding models are trained to place semantically similar texts at small angles, not at small Euclidean distances.
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
# For unit-norm vectors: just dot product
return float(np.dot(a, b))
def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
return float(np.linalg.norm(a - b))
# Why cosine wins:
# vec_short = embedding("warfarin") — unit norm: ||v|| = 1
# vec_long = embedding("warfarin anticoagulant drug given to prevent blood clots")
# Both point in similar direction → high cosine
# But Euclidean distance ignores the direction and might be largeHNSW and ANN indexing: Exhaustive cosine search is O(n × d) where d is dimension. HNSW (Hierarchical Navigable Small World) builds a graph where each node connects to its nearest neighbors, enabling approximate search in O(log n). The trade-off is ~0.1-1% recall reduction for 100x speedup.
Q2: Walk me through three chunking strategies and when to use each.
Answer:
1. Fixed-size with overlap:
def chunk_fixed(text, size=512, overlap=50):
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i:i+size])
chunks.append(chunk)
i += size - overlap
return chunksUse when: Documents have uniform structure (log files, product descriptions). Simple to implement. Works well when you need consistent chunk sizes for embedding.
2. Semantic/section-based: Split at markdown headers, section breaks, or paragraph boundaries.
Use when: Documents have clear structure (clinical guidelines, research papers, documentation). Preserves the logical unit — a "Drug Interactions" section stays together. Critical for clinical text where separating a drug name from its interaction list destroys retrieval quality.
3. Hierarchical (parent-child): Index small chunks for retrieval precision, return larger parent chunks for LLM context.
# Index child chunks (256 tokens) for precise retrieval
# Return parent chunk (1024 tokens) for generation
def hierarchical_chunk(doc, child_size=256, parent_size=1024):
# Create parent chunks
parent_words = doc.split()
parents = []
for i in range(0, len(parent_words), parent_size):
parents.append(" ".join(parent_words[i:i+parent_size]))
# Create child chunks within each parent
children = []
for p_idx, parent in enumerate(parents):
parent_words_list = parent.split()
for c_i in range(0, len(parent_words_list), child_size):
child = " ".join(parent_words_list[c_i:c_i+child_size])
children.append({"text": child, "parent_idx": p_idx})
return parents, childrenUse when: You want high retrieval precision (small chunks rank well) but rich generation context (return the full section). Common in production RAG systems.
The chunking-retrieval tradeoff:
- Smaller chunks → higher precision (exact match), lower recall, less context
- Larger chunks → more context, lower precision, risk of off-topic retrieval
Q3: What is BM25 and when does hybrid search outperform pure vector search?
Answer:
BM25 (Best Match 25) is a probabilistic ranking function based on term frequency and inverse document frequency:
BM25(D, Q) = Σ IDF(q_i) × (tf(q_i, D) × (k1+1)) / (tf(q_i, D) + k1 × (1 - b + b × |D|/avgdl))Where:
tf(q_i, D)= term frequency of term q_i in document DIDF(q_i)= inverse document frequency (penalizes common terms)k1 = 1.5= term frequency saturationb = 0.75= document length normalization
When hybrid outperforms pure vector:
-
Exact keyword queries: "Give me FDA label for warfarin 5mg" — the exact phrase "FDA label" and "warfarin" are critical. Vector search might return semantically related but wrong drugs.
-
Drug names and proper nouns: "CYP2C9" has a specific embedding, but if a document spells it "CYP 2C9" or "cytochrome P450 2C9," BM25 exact match finds it.
-
Rare terms: An obscure drug name has few training examples in the embedding model. BM25 handles this perfectly through IDF.
Hybrid search with Reciprocal Rank Fusion (RRF):
def rrf_fusion(vector_results, bm25_results, k=60):
scores = {}
for rank, doc in enumerate(vector_results, start=1):
scores[doc["id"]] = scores.get(doc["id"], 0) + 1/(k + rank)
for rank, doc in enumerate(bm25_results, start=1):
scores[doc["id"]] = scores.get(doc["id"], 0) + 1/(k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)RRF is preferred over score normalization because it's hyperparameter-free — you don't need to tune the relative weight of vector vs BM25 scores.
Q4: How do you evaluate retrieval quality? What metrics matter most?
Answer:
Retrieval evaluation requires labeled data: for each query, which documents are ground-truth relevant?
Key metrics:
Precision@k: Fraction of top-k retrieved docs that are relevant.
P@5 = (relevant docs in top 5) / 5Good for when you care about precision over recall (clinical AI — don't surface irrelevant material).
Recall@k: Fraction of all relevant docs that appear in top-k.
R@5 = (relevant docs in top 5) / (total relevant docs)Good for comprehensive research use cases.
Mean Reciprocal Rank (MRR): Average of 1/rank of first relevant doc.
MRR = mean(1 / rank_of_first_relevant_doc)Answers "how quickly does the user see the first useful result?"
NDCG@k (Normalized Discounted Cumulative Gain): Accounts for position — a relevant doc at rank 1 counts more than rank 5.
import math
def ndcg_at_k(retrieved_ids, relevant_ids, k):
def dcg(ids):
return sum(
(1 if doc_id in relevant_ids else 0) / math.log2(i + 2)
for i, doc_id in enumerate(ids[:k])
)
ideal = dcg(list(relevant_ids)[:k])
return dcg(retrieved_ids) / ideal if ideal > 0 else 0Practical guidance:
- For clinical AI: prioritize P@5 (precision) over recall — a bad doc is worse than a missing doc
- Track MRR to ensure the most relevant doc appears high in results
- NDCG@10 is the standard for comparing retrieval systems in published benchmarks
Q5: Explain query rewriting strategies: HyDE, multi-query, and step-back prompting.
Answer:
HyDE (Hypothetical Document Embeddings):
Instead of embedding the query directly, generate a hypothetical answer and embed that. A hypothetical answer looks like a document (dense with information), so it retrieves documents better than a short query.
def hyde_query(question: str, llm_client) -> list[float]:
# Generate a hypothetical answer
hypothetical = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Write a paragraph answering: {question}"}],
).choices[0].message.content
# Embed the hypothetical answer instead of the raw question
emb = llm_client.embeddings.create(
model="text-embedding-3-small", input=[hypothetical]
)
return emb.data[0].embeddingWorks best when: Questions are short and abstract, documents are dense and specific.
Multi-query expansion:
Generate multiple phrasings of the question, retrieve for each, then union-merge results.
def multi_query_retrieve(question, retriever, llm_client, n=3):
# Generate alternatives
alts = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
f"Write {n} different search queries for: {question}. Return JSON: {{queries: [...]}}"}],
response_format={"type": "json_object"},
)
import json
queries = json.loads(alts.choices[0].message.content).get("queries", [question])
# Retrieve for each, merge
seen_ids = set()
merged = []
for q in [question] + queries:
emb = llm_client.embeddings.create(model="text-embedding-3-small", input=[q])
docs = retriever.retrieve(emb.data[0].embedding, top_k=3)
for doc in docs:
if doc["id"] not in seen_ids:
merged.append(doc)
seen_ids.add(doc["id"])
return mergedWorks best when: A single query might miss relevant documents due to vocabulary mismatch.
Step-back prompting:
Generate a more general question, retrieve for it, then add as additional context. Helps with specific questions that require general background.
- Original: "What is the CYP2D6-mediated interaction between fluoxetine and tramadol?"
- Step-back: "How does CYP2D6 metabolism affect drug interactions?"
Retrieve for both — the step-back query finds foundational material that grounds the specific answer.
Q6: What are the tradeoffs between cross-encoder reranking and bi-encoder retrieval?
Answer:
Bi-encoder (used in first-stage retrieval):
- Both query and document are encoded independently into vectors
- Similarity = cosine(query_vec, doc_vec) — computed at query time
- Pros: Pre-computed document embeddings, O(1) retrieval per doc via ANN index
- Cons: No interaction between query and document tokens
Cross-encoder (used in reranking):
- Query and document are fed together through the model
- Attention can flow between all query and document tokens
- Pros: Far more accurate relevance score — the model understands query-document interaction
- Cons: Cannot pre-compute. Must run for every (query, doc) pair at query time → too slow for full collection
The two-stage pipeline:
First stage: Bi-encoder → retrieve 50-100 candidates (fast, ANN search)
Second stage: Cross-encoder → rerank to top 5 (slow but accurate)from sentence_transformers.cross_encoder import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, candidates):
pairs = [(query, doc["content"]) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked]Tradeoffs in practice: | Property | Bi-encoder | Cross-encoder | |---|---|---| | Speed | Sub-millisecond | 50-200ms per doc | | Accuracy | Good | Excellent | | Scalability | Millions of docs | 20-100 docs max | | Use case | First-stage retrieval | Reranking |
Q7: How would you handle a RAG system where documents are in multiple languages?
Answer:
Option 1: Multilingual embedding models
Use an embedding model trained on multilingual data. The model maps text in different languages to the same semantic space.
from sentence_transformers import SentenceTransformer
# multilingual-e5-large: supports 100 languages
model = SentenceTransformer("intfloat/multilingual-e5-large")
# Embed query in English, retrieve Spanish/French/German docs
# Works because the model maps equivalent meanings to nearby vectorsOption 2: Translate-then-retrieve
Translate all documents to English during ingestion, embed in English. At query time, translate the query to English. Pros: Use the best English embedding model. Cons: Translation cost, latency, potential meaning loss.
Option 3: Language-specific collections
Maintain separate vector collections per language. Detect query language, route to appropriate collection. Pros: Best retrieval quality per language. Cons: Must duplicate queries in each language if searching across languages.
Clinical AI recommendation: For a multilingual clinical system, use multilingual-e5-large or multilingual-mpnet-base-v2 for retrieval, then generate responses using a model with native multilingual support (GPT-4o, Claude). Don't rely on translation for clinical content — drug names and dosages must survive translation exactly.
Q8: Explain how RAPTOR improves on standard RAG for global questions.
Answer:
Standard RAG fails on questions like "What are the main themes across all clinical guidelines in this corpus?" because:
- No single chunk contains a high-level summary
- Retrieving many chunks and sending them all overflows the context window
- Averaging chunk embeddings loses nuance
RAPTOR builds a tree:
- Level 0 (leaves): Original chunks
- Level 1: Cluster leaves with k-means, summarize each cluster
- Level 2: Cluster level-1 summaries, summarize again
- ... up to N levels
from sklearn.cluster import KMeans
import numpy as np
def build_raptor_level(texts, embeddings, n_clusters=10):
labels = KMeans(n_clusters=n_clusters).fit_predict(embeddings)
summaries = []
for c in range(n_clusters):
cluster_texts = [t for t, l in zip(texts, labels) if l == c]
summary = summarize(cluster_texts) # LLM call
summaries.append(summary)
return summaries
def raptor_retrieve(query_emb, tree, top_k_per_level=3):
results = []
for level, data in tree.items():
sims = data["embeddings"] @ query_emb
top_idx = np.argsort(-sims)[:top_k_per_level]
for i in top_idx:
results.append({
"level": level,
"text": data["texts"][i],
"similarity": float(sims[i]),
})
return sorted(results, key=lambda x: x["similarity"], reverse=True)For global questions: The highest-level summaries are retrieved — these contain corpus-wide themes. For specific questions: The leaf-level chunks are retrieved — these contain specific facts.
The key insight: RAPTOR retrieves from all levels simultaneously and lets the LLM use whichever level is appropriate.
Q9: What is the "lost-in-the-middle" problem and how do you mitigate it?
Answer:
Research shows that LLMs pay less attention to information in the middle of a long context window. They reliably recall information at the beginning and end, but the middle is "lost."
# Mitigation 1: Put most relevant documents first AND last
def order_for_recall(docs, query):
# Docs are sorted by relevance (most relevant first)
# Move second-most-relevant to last position
if len(docs) <= 2:
return docs
reordered = [docs[0]] + docs[2:] + [docs[1]]
return reordered
# Mitigation 2: Contextual compression — only send relevant sentences
def compress_to_relevant(query, document, llm_client):
return llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
f"Extract only sentences relevant to: '{query}'\n\n{document}"}],
).choices[0].message.content
# Mitigation 3: Reduce context size aggressively
# Only send top-3 chunks instead of top-10
# Better to have 3 relevant chunks than 10 mixed-quality ones
# Mitigation 4: Hierarchical compression
# Summarize each document before sending
# Trade: lose some detail, gain: all docs fit in first/last positionsIn clinical AI: The lost-in-middle problem is dangerous — a dosing warning buried in chunk 4 of 6 may be ignored. Mitigation:
- Use the strongest-matching document (chunk 1) and second-strongest (last chunk)
- Compress irrelevant middle chunks
- Limit to 3-4 chunks total
Q10: Design a complete RAG evaluation pipeline for a clinical drug information system.
Answer:
Phase 1: Retrieval Evaluation
Build a labeled dataset of (query → relevant_doc_ids):
# Manual labels for 100 gold-standard queries
retrieval_labels = [
{"query": "warfarin CYP2C9 interaction", "relevant_ids": ["lexicomp_warfarin_3", "fda_warfarin_label"]},
# ...
]
# Metrics to track
def eval_retrieval(retriever, labels):
mrr_scores = []
precision_5 = []
for case in labels:
retrieved = retriever.retrieve(embed(case["query"]), top_k=10)
ids = [d["id"] for d in retrieved]
relevant = set(case["relevant_ids"])
mrr_scores.append(mrr(ids, relevant))
precision_5.append(p_at_k(ids, relevant, k=5))
return {"mrr": mean(mrr_scores), "p@5": mean(precision_5)}Phase 2: Generation Evaluation
# RAGAS: faithfulness, context precision/recall, answer relevancy
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Custom clinical metrics:
# 1. Drug name accuracy — did the model get drug names right?
# 2. Dosage accuracy — did dosing numbers survive correctly?
# 3. Contraindication coverage — were safety warnings preserved?Phase 3: Clinical Safety Evaluation
# Test on adversarial queries designed to produce dangerous answers
safety_tests = [
{"query": "Can I give warfarin with aspirin?",
"must_contain": ["bleeding risk", "caution", "monitor"],
"must_not_contain": ["safe to combine"]},
]Phase 4: Regression Testing
Run evaluation on every deployment. Gate deployment on:
- MRR above 0.70
- RAGAS faithfulness above 0.90
- Zero critical safety failures
- p95 latency below 5 seconds
Continuous monitoring:
- Log every query + response
- Sample 1% for human review
- Track metric drift over time
- Alert on faithfulness drops (suggests knowledge base issues)