HyDE: Hypothetical Document Embeddings — Advanced RAG | Learnixo

The Query-Document Gap

Embeddings for queries and documents are trained differently — queries are short, conversational; documents are long, informational:

Query embedding:    embed("Is Warfarin safe during pregnancy?")
                    → embedding in "question space"

Document embedding: embed("Warfarin (coumadin) is classified as
                    FDA Pregnancy Category X — it is contraindicated...")
                    → embedding in "document space"

These may not be as similar as we'd like, even though the document
directly answers the question.

The gap is especially wide for:
  Questions vs encyclopaedic documents
  Short queries vs long technical passages
  Conversational phrasing vs formal medical writing

HyDE: The Idea

Instead of embedding the query, generate a hypothetical document that would answer the query, and embed that:

Step 1: Generate a hypothetical answer
  Query: "Is Warfarin safe during pregnancy?"
  LLM generates: "Warfarin is contraindicated during pregnancy, particularly
                  in the first trimester and near term. It crosses the placenta
                  and can cause Warfarin embryopathy..."

Step 2: Embed the hypothetical answer
  embed(hypothetical_answer)
  → embedding in "document space" — matches the vocabulary and style
    of actual documents in the knowledge base

Step 3: Use the hypothetical embedding for retrieval
  Search the knowledge base using this embedding

The hypothetical answer uses the vocabulary, style, and structure of medical documents — making it a better anchor for retrieval than the question embedding.

Implementation

Python

from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def generate_hypothetical_document(query: str, domain: str = "clinical medicine") -> str:
    """Generate a hypothetical document that would answer the query."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # fast, cheap model
        max_tokens=300,
        messages=[{"role": "user", "content":
            f"""Write a short, factual paragraph from a {domain} reference document
that would directly answer this question. Use formal medical language.
Write as if excerpting from a clinical guideline or medical reference.
Do NOT say 'This document answers...' — just write the content directly.

Question: {query}"""}]
    )
    return response.content[0].text.strip()

def hyde_retrieve(
    query: str,
    vector_search_fn,  # function(embedding, top_k) -> list[dict]
    top_k: int = 5
) -> list[dict]:
    """Retrieve using a hypothetical document embedding."""
    hypothetical = generate_hypothetical_document(query)
    hyp_embedding = embedder.encode(hypothetical)
    return vector_search_fn(hyp_embedding, top_k=top_k)

# Example:
query = "Is Warfarin safe during pregnancy?"
hyp_doc = generate_hypothetical_document(query)
print(hyp_doc)
# → "Warfarin (coumadin) is classified as FDA Pregnancy Category X and is
#    contraindicated during pregnancy. First trimester exposure may cause
#    Warfarin embryopathy (nasal hypoplasia, stippled epiphyses). Near-term
#    exposure carries risk of neonatal hemorrhage..."

HyDE vs Direct Query Retrieval

Direct query embedding:
  Query: "Is Warfarin safe during pregnancy?"
  embedding of short conversational question
  → may not match clinical reference document embeddings closely

HyDE:
  Hypothetical: "Warfarin is FDA Category X, contraindicated in pregnancy..."
  embedding of clinical-style text
  → much closer to actual clinical document embeddings

Empirical results (Gao et al., 2022):
  HyDE outperforms direct embedding on most BEIR benchmark tasks
  Gains are largest for: fact retrieval, medical/scientific queries
  Gains are smaller for: simpler factual lookups

When HyDE Helps and Hurts

HyDE helps:
  Complex technical questions where query vocabulary differs from document vocabulary
  Medical and scientific queries (formal vs conversational gap is large)
  Queries about specific clinical scenarios or guidelines
  Low-resource languages (generate hypothetical in the target language)

HyDE hurts or adds little:
  Queries where the model doesn't know the answer (hallucinated hypothesis)
  Simple keyword lookups ("What is Warfarin?") — gap is small already
  Domains where the model has limited knowledge (may generate wrong hypothesis)

Clinical risk:
  If the hypothetical answer is factually wrong, it may retrieve irrelevant docs
  The LLM's hallucinations could misdirect retrieval
  Mitigation: use HyDE only when you trust the model's domain knowledge,
  or combine with standard retrieval and use RRF to merge

Ensemble: HyDE + Standard Retrieval

More robust than either alone:

Python

def ensemble_retrieve(query: str, vector_search_fn, top_k: int = 5) -> list[dict]:
    """Combine standard query embedding and HyDE embeddings via RRF."""
    from collections import defaultdict

    # Standard retrieval
    query_emb = embedder.encode(query)
    standard_results = vector_search_fn(query_emb, top_k=top_k * 2)

    # HyDE retrieval
    hyp_doc = generate_hypothetical_document(query)
    hyp_emb = embedder.encode(hyp_doc)
    hyde_results = vector_search_fn(hyp_emb, top_k=top_k * 2)

    # RRF fusion
    k = 60
    scores = defaultdict(float)
    for rank, doc in enumerate(standard_results, 1):
        scores[doc["id"]] += 1.0 / (k + rank)
    for rank, doc in enumerate(hyde_results, 1):
        scores[doc["id"]] += 1.0 / (k + rank)

    # Build final ranked list
    all_docs = {d["id"]: d for d in standard_results + hyde_results}
    ranked_ids = sorted(scores, key=scores.get, reverse=True)[:top_k]
    return [all_docs[doc_id] for doc_id in ranked_ids]

Interview Answer

"HyDE (Hypothetical Document Embeddings) addresses the query-document embedding gap: a short conversational question lives in a different embedding subspace than long formal reference documents. HyDE generates a hypothetical paragraph-length answer using a small LLM, then embeds that hypothetical document for retrieval. Since the hypothetical uses clinical vocabulary and formal structure, it's closer to real documents in embedding space. Gao et al. showed this outperforms direct query embedding on most information retrieval benchmarks. The risk is hallucination: if the model generates a wrong hypothesis, retrieval is misdirected. Mitigation: combine HyDE and standard retrieval via RRF for robustness."