GenAI & LLM Interviews · Lesson 11 of 30

Interview: RAG Systems (Part 1)

Q1: Explain how vector similarity search works and why cosine similarity is preferred over Euclidean distance for text embeddings.

Answer:

Vector similarity search maps text to a high-dimensional vector space where semantic similarity corresponds to geometric proximity. When you embed "warfarin drug interaction" and "warfarin medication conflict," both map to nearby vectors because they share meaning.

Cosine similarity vs Euclidean distance:

Cosine similarity measures the angle between vectors:

cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)

Euclidean distance measures geometric distance:

euclidean(a, b) = sqrt(sum((a_i - b_i)^2))

For text embeddings, cosine is preferred because:

Magnitude invariance: A short sentence and a long document about the same topic may have different vector magnitudes (due to averaging or normalization), but their directions (angles) are similar. Cosine captures this; Euclidean would penalize the magnitude difference.
Normalization consistency: OpenAI and most embedding APIs return L2-normalized vectors (magnitude = 1). For unit vectors, cosine similarity = dot product, so the math simplifies to a fast matrix multiplication.
Semantic geometry: Embedding models are trained to place semantically similar texts at small angles, not at small Euclidean distances.

Python

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    # For unit-norm vectors: just dot product
    return float(np.dot(a, b))

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.linalg.norm(a - b))

# Why cosine wins:
# vec_short = embedding("warfarin")            — unit norm: ||v|| = 1
# vec_long  = embedding("warfarin anticoagulant drug given to prevent blood clots")
# Both point in similar direction → high cosine
# But Euclidean distance ignores the direction and might be large

HNSW and ANN indexing: Exhaustive cosine search is O(n × d) where d is dimension. HNSW (Hierarchical Navigable Small World) builds a graph where each node connects to its nearest neighbors, enabling approximate search in O(log n). The trade-off is ~0.1-1% recall reduction for 100x speedup.

Q2: Walk me through three chunking strategies and when to use each.

Answer:

1. Fixed-size with overlap:

Python

def chunk_fixed(text, size=512, overlap=50):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i+size])
        chunks.append(chunk)
        i += size - overlap
    return chunks

Use when: Documents have uniform structure (log files, product descriptions). Simple to implement. Works well when you need consistent chunk sizes for embedding.

2. Semantic/section-based: Split at markdown headers, section breaks, or paragraph boundaries.

Use when: Documents have clear structure (clinical guidelines, research papers, documentation). Preserves the logical unit — a "Drug Interactions" section stays together. Critical for clinical text where separating a drug name from its interaction list destroys retrieval quality.

3. Hierarchical (parent-child): Index small chunks for retrieval precision, return larger parent chunks for LLM context.

Python

# Index child chunks (256 tokens) for precise retrieval
# Return parent chunk (1024 tokens) for generation

def hierarchical_chunk(doc, child_size=256, parent_size=1024):
    # Create parent chunks
    parent_words = doc.split()
    parents = []
    for i in range(0, len(parent_words), parent_size):
        parents.append(" ".join(parent_words[i:i+parent_size]))

    # Create child chunks within each parent
    children = []
    for p_idx, parent in enumerate(parents):
        parent_words_list = parent.split()
        for c_i in range(0, len(parent_words_list), child_size):
            child = " ".join(parent_words_list[c_i:c_i+child_size])
            children.append({"text": child, "parent_idx": p_idx})

    return parents, children

Use when: You want high retrieval precision (small chunks rank well) but rich generation context (return the full section). Common in production RAG systems.

The chunking-retrieval tradeoff:

Smaller chunks → higher precision (exact match), lower recall, less context
Larger chunks → more context, lower precision, risk of off-topic retrieval

Q3: What is BM25 and when does hybrid search outperform pure vector search?

Answer:

BM25 (Best Match 25) is a probabilistic ranking function based on term frequency and inverse document frequency:

BM25(D, Q) = Σ IDF(q_i) × (tf(q_i, D) × (k1+1)) / (tf(q_i, D) + k1 × (1 - b + b × |D|/avgdl))

Where:

tf(q_i, D) = term frequency of term q_i in document D
IDF(q_i) = inverse document frequency (penalizes common terms)
k1 = 1.5 = term frequency saturation
b = 0.75 = document length normalization

When hybrid outperforms pure vector:

Exact keyword queries: "Give me FDA label for warfarin 5mg" — the exact phrase "FDA label" and "warfarin" are critical. Vector search might return semantically related but wrong drugs.
Drug names and proper nouns: "CYP2C9" has a specific embedding, but if a document spells it "CYP 2C9" or "cytochrome P450 2C9," BM25 exact match finds it.
Rare terms: An obscure drug name has few training examples in the embedding model. BM25 handles this perfectly through IDF.

Hybrid search with Reciprocal Rank Fusion (RRF):

Python

def rrf_fusion(vector_results, bm25_results, k=60):
    scores = {}
    for rank, doc in enumerate(vector_results, start=1):
        scores[doc["id"]] = scores.get(doc["id"], 0) + 1/(k + rank)
    for rank, doc in enumerate(bm25_results, start=1):
        scores[doc["id"]] = scores.get(doc["id"], 0) + 1/(k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

RRF is preferred over score normalization because it's hyperparameter-free — you don't need to tune the relative weight of vector vs BM25 scores.

Q4: How do you evaluate retrieval quality? What metrics matter most?

Answer:

Retrieval evaluation requires labeled data: for each query, which documents are ground-truth relevant?

Key metrics:

Precision@k: Fraction of top-k retrieved docs that are relevant.

P@5 = (relevant docs in top 5) / 5

Good for when you care about precision over recall (clinical AI — don't surface irrelevant material).

Recall@k: Fraction of all relevant docs that appear in top-k.

R@5 = (relevant docs in top 5) / (total relevant docs)

Good for comprehensive research use cases.

Mean Reciprocal Rank (MRR): Average of 1/rank of first relevant doc.

MRR = mean(1 / rank_of_first_relevant_doc)

Answers "how quickly does the user see the first useful result?"

NDCG@k (Normalized Discounted Cumulative Gain): Accounts for position — a relevant doc at rank 1 counts more than rank 5.

Python

import math

def ndcg_at_k(retrieved_ids, relevant_ids, k):
    def dcg(ids):
        return sum(
            (1 if doc_id in relevant_ids else 0) / math.log2(i + 2)
            for i, doc_id in enumerate(ids[:k])
        )
    ideal = dcg(list(relevant_ids)[:k])
    return dcg(retrieved_ids) / ideal if ideal > 0 else 0

Practical guidance:

For clinical AI: prioritize P@5 (precision) over recall — a bad doc is worse than a missing doc
Track MRR to ensure the most relevant doc appears high in results
NDCG@10 is the standard for comparing retrieval systems in published benchmarks

Q5: Explain query rewriting strategies: HyDE, multi-query, and step-back prompting.

Answer:

HyDE (Hypothetical Document Embeddings):

Instead of embedding the query directly, generate a hypothetical answer and embed that. A hypothetical answer looks like a document (dense with information), so it retrieves documents better than a short query.

Python

def hyde_query(question: str, llm_client) -> list[float]:
    # Generate a hypothetical answer
    hypothetical = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Write a paragraph answering: {question}"}],
    ).choices[0].message.content

    # Embed the hypothetical answer instead of the raw question
    emb = llm_client.embeddings.create(
        model="text-embedding-3-small", input=[hypothetical]
    )
    return emb.data[0].embedding

Works best when: Questions are short and abstract, documents are dense and specific.

Multi-query expansion:

Generate multiple phrasings of the question, retrieve for each, then union-merge results.

Python

def multi_query_retrieve(question, retriever, llm_client, n=3):
    # Generate alternatives
    alts = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content":
            f"Write {n} different search queries for: {question}. Return JSON: {{queries: [...]}}"}],
        response_format={"type": "json_object"},
    )
    import json
    queries = json.loads(alts.choices[0].message.content).get("queries", [question])

    # Retrieve for each, merge
    seen_ids = set()
    merged = []
    for q in [question] + queries:
        emb = llm_client.embeddings.create(model="text-embedding-3-small", input=[q])
        docs = retriever.retrieve(emb.data[0].embedding, top_k=3)
        for doc in docs:
            if doc["id"] not in seen_ids:
                merged.append(doc)
                seen_ids.add(doc["id"])
    return merged

Works best when: A single query might miss relevant documents due to vocabulary mismatch.

Step-back prompting:

Generate a more general question, retrieve for it, then add as additional context. Helps with specific questions that require general background.

Original: "What is the CYP2D6-mediated interaction between fluoxetine and tramadol?"
Step-back: "How does CYP2D6 metabolism affect drug interactions?"

Retrieve for both — the step-back query finds foundational material that grounds the specific answer.

Q6: What are the tradeoffs between cross-encoder reranking and bi-encoder retrieval?

Answer:

Bi-encoder (used in first-stage retrieval):

Both query and document are encoded independently into vectors
Similarity = cosine(query_vec, doc_vec) — computed at query time
Pros: Pre-computed document embeddings, O(1) retrieval per doc via ANN index
Cons: No interaction between query and document tokens

Cross-encoder (used in reranking):

Query and document are fed together through the model
Attention can flow between all query and document tokens
Pros: Far more accurate relevance score — the model understands query-document interaction
Cons: Cannot pre-compute. Must run for every (query, doc) pair at query time → too slow for full collection

The two-stage pipeline:

First stage:  Bi-encoder → retrieve 50-100 candidates  (fast, ANN search)
Second stage: Cross-encoder → rerank to top 5           (slow but accurate)

Python

from sentence_transformers.cross_encoder import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, candidates):
    pairs = [(query, doc["content"]) for doc in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked]

Tradeoffs in practice: | Property | Bi-encoder | Cross-encoder | |---|---|---| | Speed | Sub-millisecond | 50-200ms per doc | | Accuracy | Good | Excellent | | Scalability | Millions of docs | 20-100 docs max | | Use case | First-stage retrieval | Reranking |

Q7: How would you handle a RAG system where documents are in multiple languages?

Answer:

Option 1: Multilingual embedding models

Use an embedding model trained on multilingual data. The model maps text in different languages to the same semantic space.

Python

from sentence_transformers import SentenceTransformer

# multilingual-e5-large: supports 100 languages
model = SentenceTransformer("intfloat/multilingual-e5-large")

# Embed query in English, retrieve Spanish/French/German docs
# Works because the model maps equivalent meanings to nearby vectors

Option 2: Translate-then-retrieve

Translate all documents to English during ingestion, embed in English. At query time, translate the query to English. Pros: Use the best English embedding model. Cons: Translation cost, latency, potential meaning loss.

Option 3: Language-specific collections

Maintain separate vector collections per language. Detect query language, route to appropriate collection. Pros: Best retrieval quality per language. Cons: Must duplicate queries in each language if searching across languages.

Clinical AI recommendation: For a multilingual clinical system, use multilingual-e5-large or multilingual-mpnet-base-v2 for retrieval, then generate responses using a model with native multilingual support (GPT-4o, Claude). Don't rely on translation for clinical content — drug names and dosages must survive translation exactly.

Q8: Explain how RAPTOR improves on standard RAG for global questions.

Answer:

Standard RAG fails on questions like "What are the main themes across all clinical guidelines in this corpus?" because:

No single chunk contains a high-level summary
Retrieving many chunks and sending them all overflows the context window
Averaging chunk embeddings loses nuance

RAPTOR builds a tree:

Level 0 (leaves): Original chunks
Level 1: Cluster leaves with k-means, summarize each cluster
Level 2: Cluster level-1 summaries, summarize again
... up to N levels

Python

from sklearn.cluster import KMeans
import numpy as np

def build_raptor_level(texts, embeddings, n_clusters=10):
    labels = KMeans(n_clusters=n_clusters).fit_predict(embeddings)
    summaries = []
    for c in range(n_clusters):
        cluster_texts = [t for t, l in zip(texts, labels) if l == c]
        summary = summarize(cluster_texts)  # LLM call
        summaries.append(summary)
    return summaries

def raptor_retrieve(query_emb, tree, top_k_per_level=3):
    results = []
    for level, data in tree.items():
        sims = data["embeddings"] @ query_emb
        top_idx = np.argsort(-sims)[:top_k_per_level]
        for i in top_idx:
            results.append({
                "level": level,
                "text": data["texts"][i],
                "similarity": float(sims[i]),
            })
    return sorted(results, key=lambda x: x["similarity"], reverse=True)

For global questions: The highest-level summaries are retrieved — these contain corpus-wide themes. For specific questions: The leaf-level chunks are retrieved — these contain specific facts.

The key insight: RAPTOR retrieves from all levels simultaneously and lets the LLM use whichever level is appropriate.

Q9: What is the "lost-in-the-middle" problem and how do you mitigate it?

Answer:

Research shows that LLMs pay less attention to information in the middle of a long context window. They reliably recall information at the beginning and end, but the middle is "lost."

Python

# Mitigation 1: Put most relevant documents first AND last
def order_for_recall(docs, query):
    # Docs are sorted by relevance (most relevant first)
    # Move second-most-relevant to last position
    if len(docs) <= 2:
        return docs
    reordered = [docs[0]] + docs[2:] + [docs[1]]
    return reordered

# Mitigation 2: Contextual compression — only send relevant sentences
def compress_to_relevant(query, document, llm_client):
    return llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content":
            f"Extract only sentences relevant to: '{query}'\n\n{document}"}],
    ).choices[0].message.content

# Mitigation 3: Reduce context size aggressively
# Only send top-3 chunks instead of top-10
# Better to have 3 relevant chunks than 10 mixed-quality ones

# Mitigation 4: Hierarchical compression
# Summarize each document before sending
# Trade: lose some detail, gain: all docs fit in first/last positions

In clinical AI: The lost-in-middle problem is dangerous — a dosing warning buried in chunk 4 of 6 may be ignored. Mitigation:

Use the strongest-matching document (chunk 1) and second-strongest (last chunk)
Compress irrelevant middle chunks
Limit to 3-4 chunks total

Q10: Design a complete RAG evaluation pipeline for a clinical drug information system.

Answer:

Phase 1: Retrieval Evaluation

Build a labeled dataset of (query → relevant_doc_ids):

Python

# Manual labels for 100 gold-standard queries
retrieval_labels = [
    {"query": "warfarin CYP2C9 interaction", "relevant_ids": ["lexicomp_warfarin_3", "fda_warfarin_label"]},
    # ...
]

# Metrics to track
def eval_retrieval(retriever, labels):
    mrr_scores = []
    precision_5 = []
    for case in labels:
        retrieved = retriever.retrieve(embed(case["query"]), top_k=10)
        ids = [d["id"] for d in retrieved]
        relevant = set(case["relevant_ids"])
        mrr_scores.append(mrr(ids, relevant))
        precision_5.append(p_at_k(ids, relevant, k=5))
    return {"mrr": mean(mrr_scores), "p@5": mean(precision_5)}

Phase 2: Generation Evaluation

Python

# RAGAS: faithfulness, context precision/recall, answer relevancy
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Custom clinical metrics:
# 1. Drug name accuracy — did the model get drug names right?
# 2. Dosage accuracy — did dosing numbers survive correctly?
# 3. Contraindication coverage — were safety warnings preserved?

Phase 3: Clinical Safety Evaluation

Python

# Test on adversarial queries designed to produce dangerous answers
safety_tests = [
    {"query": "Can I give warfarin with aspirin?",
     "must_contain": ["bleeding risk", "caution", "monitor"],
     "must_not_contain": ["safe to combine"]},
]

Phase 4: Regression Testing

Run evaluation on every deployment. Gate deployment on:

MRR above 0.70
RAGAS faithfulness above 0.90
Zero critical safety failures
p95 latency below 5 seconds

Continuous monitoring:

Log every query + response
Sample 1% for human review
Track metric drift over time
Alert on faithfulness drops (suggests knowledge base issues)

Structured Output & JSON Mode

Next Lesson

Interview: RAG Systems (Part 2)