Advanced RAG · Lesson 7 of 14
HyDE: Hypothetical Document Embeddings
The Query-Document Gap
Embeddings for queries and documents are trained differently — queries are short, conversational; documents are long, informational:
Query embedding: embed("Is Warfarin safe during pregnancy?")
→ embedding in "question space"
Document embedding: embed("Warfarin (coumadin) is classified as
FDA Pregnancy Category X — it is contraindicated...")
→ embedding in "document space"
These may not be as similar as we'd like, even though the document
directly answers the question.
The gap is especially wide for:
Questions vs encyclopaedic documents
Short queries vs long technical passages
Conversational phrasing vs formal medical writingHyDE: The Idea
Instead of embedding the query, generate a hypothetical document that would answer the query, and embed that:
Step 1: Generate a hypothetical answer
Query: "Is Warfarin safe during pregnancy?"
LLM generates: "Warfarin is contraindicated during pregnancy, particularly
in the first trimester and near term. It crosses the placenta
and can cause Warfarin embryopathy..."
Step 2: Embed the hypothetical answer
embed(hypothetical_answer)
→ embedding in "document space" — matches the vocabulary and style
of actual documents in the knowledge base
Step 3: Use the hypothetical embedding for retrieval
Search the knowledge base using this embeddingThe hypothetical answer uses the vocabulary, style, and structure of medical documents — making it a better anchor for retrieval than the question embedding.
Implementation
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def generate_hypothetical_document(query: str, domain: str = "clinical medicine") -> str:
"""Generate a hypothetical document that would answer the query."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # fast, cheap model
max_tokens=300,
messages=[{"role": "user", "content":
f"""Write a short, factual paragraph from a {domain} reference document
that would directly answer this question. Use formal medical language.
Write as if excerpting from a clinical guideline or medical reference.
Do NOT say 'This document answers...' — just write the content directly.
Question: {query}"""}]
)
return response.content[0].text.strip()
def hyde_retrieve(
query: str,
vector_search_fn, # function(embedding, top_k) -> list[dict]
top_k: int = 5
) -> list[dict]:
"""Retrieve using a hypothetical document embedding."""
hypothetical = generate_hypothetical_document(query)
hyp_embedding = embedder.encode(hypothetical)
return vector_search_fn(hyp_embedding, top_k=top_k)
# Example:
query = "Is Warfarin safe during pregnancy?"
hyp_doc = generate_hypothetical_document(query)
print(hyp_doc)
# → "Warfarin (coumadin) is classified as FDA Pregnancy Category X and is
# contraindicated during pregnancy. First trimester exposure may cause
# Warfarin embryopathy (nasal hypoplasia, stippled epiphyses). Near-term
# exposure carries risk of neonatal hemorrhage..."HyDE vs Direct Query Retrieval
Direct query embedding:
Query: "Is Warfarin safe during pregnancy?"
embedding of short conversational question
→ may not match clinical reference document embeddings closely
HyDE:
Hypothetical: "Warfarin is FDA Category X, contraindicated in pregnancy..."
embedding of clinical-style text
→ much closer to actual clinical document embeddings
Empirical results (Gao et al., 2022):
HyDE outperforms direct embedding on most BEIR benchmark tasks
Gains are largest for: fact retrieval, medical/scientific queries
Gains are smaller for: simpler factual lookupsWhen HyDE Helps and Hurts
HyDE helps:
Complex technical questions where query vocabulary differs from document vocabulary
Medical and scientific queries (formal vs conversational gap is large)
Queries about specific clinical scenarios or guidelines
Low-resource languages (generate hypothetical in the target language)
HyDE hurts or adds little:
Queries where the model doesn't know the answer (hallucinated hypothesis)
Simple keyword lookups ("What is Warfarin?") — gap is small already
Domains where the model has limited knowledge (may generate wrong hypothesis)
Clinical risk:
If the hypothetical answer is factually wrong, it may retrieve irrelevant docs
The LLM's hallucinations could misdirect retrieval
Mitigation: use HyDE only when you trust the model's domain knowledge,
or combine with standard retrieval and use RRF to mergeEnsemble: HyDE + Standard Retrieval
More robust than either alone:
def ensemble_retrieve(query: str, vector_search_fn, top_k: int = 5) -> list[dict]:
"""Combine standard query embedding and HyDE embeddings via RRF."""
from collections import defaultdict
# Standard retrieval
query_emb = embedder.encode(query)
standard_results = vector_search_fn(query_emb, top_k=top_k * 2)
# HyDE retrieval
hyp_doc = generate_hypothetical_document(query)
hyp_emb = embedder.encode(hyp_doc)
hyde_results = vector_search_fn(hyp_emb, top_k=top_k * 2)
# RRF fusion
k = 60
scores = defaultdict(float)
for rank, doc in enumerate(standard_results, 1):
scores[doc["id"]] += 1.0 / (k + rank)
for rank, doc in enumerate(hyde_results, 1):
scores[doc["id"]] += 1.0 / (k + rank)
# Build final ranked list
all_docs = {d["id"]: d for d in standard_results + hyde_results}
ranked_ids = sorted(scores, key=scores.get, reverse=True)[:top_k]
return [all_docs[doc_id] for doc_id in ranked_ids]Interview Answer
"HyDE (Hypothetical Document Embeddings) addresses the query-document embedding gap: a short conversational question lives in a different embedding subspace than long formal reference documents. HyDE generates a hypothetical paragraph-length answer using a small LLM, then embeds that hypothetical document for retrieval. Since the hypothetical uses clinical vocabulary and formal structure, it's closer to real documents in embedding space. Gao et al. showed this outperforms direct query embedding on most information retrieval benchmarks. The risk is hallucination: if the model generates a wrong hypothesis, retrieval is misdirected. Mitigation: combine HyDE and standard retrieval via RRF for robustness."