Advanced RAG · Lesson 6 of 14
Multi-Query Retrieval: Generate Multiple Questions
The Single Query Problem
A single query searches from one angle. Different phrasings may retrieve different relevant documents:
User query: "Is Warfarin safe during pregnancy?"
Single query retrieval might miss:
- Documents about "anticoagulation in pregnant patients"
- Documents about "teratogenicity of vitamin K antagonists"
- Documents about "pregnancy outcomes with Warfarin exposure"
- Guidelines framed as "Warfarin is contraindicated in..."
These are all relevant but use different terminology.
A single query won't match all of them.Multi-Query Approach
Generate multiple rephrased/reframed versions of the original query. Retrieve for each. Merge and deduplicate:
Original: "Is Warfarin safe during pregnancy?"
Generated variants:
1. "Warfarin safety in pregnant patients"
2. "anticoagulation therapy pregnancy risks"
3. "vitamin K antagonist teratogenicity"
4. "Warfarin contraindications pregnancy"
5. "VKA fetal outcomes maternal anticoagulation"
Retrieve top-5 for each → 5 × 5 = 25 candidates (with duplicates)
Deduplicate by document ID
Rerank the merged set
Return top-5 after rerankingImplementation
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def generate_query_variants(query: str, n_variants: int = 4) -> list[str]:
"""Generate multiple alternative phrasings of the query."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content":
f"""Generate {n_variants} alternative phrasings of this question for database retrieval.
Each phrasing should use different terminology while preserving the meaning.
Return ONLY the phrasings, one per line, no numbering.
Question: {query}"""}]
)
variants = [line.strip() for line in response.content[0].text.strip().split("\n")
if line.strip()]
return [query] + variants[:n_variants] # include original
def multi_query_retrieve(
query: str,
retriever, # function(query: str) -> list[dict]
top_k_per_query: int = 5,
final_top_k: int = 5
) -> list[dict]:
"""Retrieve for multiple query variants and merge results."""
variants = generate_query_variants(query)
# Retrieve for each variant
all_docs: dict[str, dict] = {}
doc_counts: dict[str, int] = {}
for variant in variants:
results = retriever(variant, top_k=top_k_per_query)
for doc in results:
doc_id = doc["id"]
if doc_id not in all_docs:
all_docs[doc_id] = doc
doc_counts[doc_id] = 1
else:
doc_counts[doc_id] += 1 # appeared in multiple queries
# Sort by: number of query variants that retrieved this doc (consensus),
# then by average similarity score
sorted_docs = sorted(
all_docs.values(),
key=lambda d: doc_counts[d["id"]],
reverse=True
)
return sorted_docs[:final_top_k]Consensus Scoring
Documents retrieved by multiple query variants are more likely to be relevant:
Variant 1 retrieves: [A, B, C, D, E]
Variant 2 retrieves: [A, F, G, B, H]
Variant 3 retrieves: [I, A, J, C, K]
Document A: retrieved by all 3 → high confidence in relevance
Document B: retrieved by 2 → moderate confidence
Document I: retrieved by 1 → lower confidence
Multi-query consensus ≈ implicit voting on relevanceThis is equivalent to ensemble retrieval — using multiple retrieval systems to increase robustness.
Performance vs Cost
Single query:
1 LLM embedding call + 1 vector search
Latency: 20-50ms
Cost: minimal
Multi-query (4 variants):
1 LLM call (variant generation) + 4 embedding calls + 4 vector searches + deduplication
Latency: 150-400ms
Cost: 1 extra LLM call + 3 extra embedding/search calls
Recall improvement:
Typical improvement: 10-25% on MRR@5 vs single query
Most impactful for: ambiguous queries, technical queries with vocabulary variation
When it's worth it:
Complex clinical queries where missing a key guideline is costly
Research-style queries where comprehensive coverage matters
When reranking is also applied (multi-query provides better input to reranker)
When it's not worth it:
Simple factual lookups ("What is Warfarin?")
High-throughput real-time systems where latency mattersLangChain Multi-Query Retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
vectorstore = FAISS.from_texts(documents, OpenAIEmbeddings())
base_retriever = vectorstore.as_retriever()
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=llm
)
results = multi_query_retriever.get_relevant_documents(
"Is Warfarin safe during pregnancy?"
)Interview Answer
"Multi-query retrieval generates multiple phrasings of the user's question using a small LLM, retrieves candidates for each, then deduplicates and merges the results. Documents retrieved by multiple query variants receive higher consensus confidence. This improves recall by 10-25% for technical or ambiguous queries — capturing documents that use different terminology than the original query. For 'Warfarin safety in pregnancy,' variants capture 'anticoagulation in pregnant patients,' 'VKA teratogenicity,' and 'Warfarin contraindications' — all relevant but using different vocabulary. The trade-off is 3-4× higher retrieval latency and cost, making it best suited for complex queries where missing a key document is costly."