Learnixo

Advanced RAG · Lesson 6 of 14

Multi-Query Retrieval: Generate Multiple Questions

The Single Query Problem

A single query searches from one angle. Different phrasings may retrieve different relevant documents:

User query: "Is Warfarin safe during pregnancy?"

Single query retrieval might miss:
  - Documents about "anticoagulation in pregnant patients"
  - Documents about "teratogenicity of vitamin K antagonists"
  - Documents about "pregnancy outcomes with Warfarin exposure"
  - Guidelines framed as "Warfarin is contraindicated in..."

These are all relevant but use different terminology.
A single query won't match all of them.

Multi-Query Approach

Generate multiple rephrased/reframed versions of the original query. Retrieve for each. Merge and deduplicate:

Original: "Is Warfarin safe during pregnancy?"

Generated variants:
  1. "Warfarin safety in pregnant patients"
  2. "anticoagulation therapy pregnancy risks"
  3. "vitamin K antagonist teratogenicity"
  4. "Warfarin contraindications pregnancy"
  5. "VKA fetal outcomes maternal anticoagulation"

Retrieve top-5 for each → 5 × 5 = 25 candidates (with duplicates)
Deduplicate by document ID
Rerank the merged set
Return top-5 after reranking

Implementation

Python
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def generate_query_variants(query: str, n_variants: int = 4) -> list[str]:
    """Generate multiple alternative phrasings of the query."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{"role": "user", "content":
            f"""Generate {n_variants} alternative phrasings of this question for database retrieval.
Each phrasing should use different terminology while preserving the meaning.
Return ONLY the phrasings, one per line, no numbering.

Question: {query}"""}]
    )
    variants = [line.strip() for line in response.content[0].text.strip().split("\n")
                if line.strip()]
    return [query] + variants[:n_variants]  # include original

def multi_query_retrieve(
    query: str,
    retriever,  # function(query: str) -> list[dict]
    top_k_per_query: int = 5,
    final_top_k: int = 5
) -> list[dict]:
    """Retrieve for multiple query variants and merge results."""
    variants = generate_query_variants(query)

    # Retrieve for each variant
    all_docs: dict[str, dict] = {}
    doc_counts: dict[str, int] = {}

    for variant in variants:
        results = retriever(variant, top_k=top_k_per_query)
        for doc in results:
            doc_id = doc["id"]
            if doc_id not in all_docs:
                all_docs[doc_id] = doc
                doc_counts[doc_id] = 1
            else:
                doc_counts[doc_id] += 1  # appeared in multiple queries

    # Sort by: number of query variants that retrieved this doc (consensus),
    # then by average similarity score
    sorted_docs = sorted(
        all_docs.values(),
        key=lambda d: doc_counts[d["id"]],
        reverse=True
    )

    return sorted_docs[:final_top_k]

Consensus Scoring

Documents retrieved by multiple query variants are more likely to be relevant:

Variant 1 retrieves: [A, B, C, D, E]
Variant 2 retrieves: [A, F, G, B, H]
Variant 3 retrieves: [I, A, J, C, K]

Document A: retrieved by all 3 → high confidence in relevance
Document B: retrieved by 2 → moderate confidence
Document I: retrieved by 1 → lower confidence

Multi-query consensus ≈ implicit voting on relevance

This is equivalent to ensemble retrieval — using multiple retrieval systems to increase robustness.


Performance vs Cost

Single query:
  1 LLM embedding call + 1 vector search
  Latency: 20-50ms
  Cost: minimal

Multi-query (4 variants):
  1 LLM call (variant generation) + 4 embedding calls + 4 vector searches + deduplication
  Latency: 150-400ms
  Cost: 1 extra LLM call + 3 extra embedding/search calls

Recall improvement:
  Typical improvement: 10-25% on MRR@5 vs single query
  Most impactful for: ambiguous queries, technical queries with vocabulary variation

When it's worth it:
  Complex clinical queries where missing a key guideline is costly
  Research-style queries where comprehensive coverage matters
  When reranking is also applied (multi-query provides better input to reranker)

When it's not worth it:
  Simple factual lookups ("What is Warfarin?")
  High-throughput real-time systems where latency matters

LangChain Multi-Query Retriever

Python
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
vectorstore = FAISS.from_texts(documents, OpenAIEmbeddings())
base_retriever = vectorstore.as_retriever()

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=base_retriever,
    llm=llm
)

results = multi_query_retriever.get_relevant_documents(
    "Is Warfarin safe during pregnancy?"
)

Interview Answer

"Multi-query retrieval generates multiple phrasings of the user's question using a small LLM, retrieves candidates for each, then deduplicates and merges the results. Documents retrieved by multiple query variants receive higher consensus confidence. This improves recall by 10-25% for technical or ambiguous queries — capturing documents that use different terminology than the original query. For 'Warfarin safety in pregnancy,' variants capture 'anticoagulation in pregnant patients,' 'VKA teratogenicity,' and 'Warfarin contraindications' — all relevant but using different vocabulary. The trade-off is 3-4× higher retrieval latency and cost, making it best suited for complex queries where missing a key document is costly."