HyDE: Hypothetical Document Embeddings

The Query-Document Gap

Embeddings for queries and documents are trained differently — queries are short, conversational; documents are long, informational:

Query embedding:    embed("Is Warfarin safe during pregnancy?")
                    → embedding in "question space"

Document embedding: embed("Warfarin (coumadin) is classified as
                    FDA Pregnancy Category X — it is contraindicated...")
                    → embedding in "document space"

These may not be as similar as we'd like, even though the document
directly answers the question.

The gap is especially wide for:
  Questions vs encyclopaedic documents
  Short queries vs long technical passages
  Conversational phrasing vs formal medical writing

HyDE: The Idea

Instead of embedding the query, generate a hypothetical document that would answer the query, and embed that:

Step 1: Generate a hypothetical answer
  Query: "Is Warfarin safe during pregnancy?"
  LLM generates: "Warfarin is contraindicated during pregnancy, particularly
                  in the first trimester and near term. It crosses the placenta
                  and can cause Warfarin embryopathy..."

Step 2: Embed the hypothetical answer
  embed(hypothetical_answer)
  → embedding in "document space" — matches the vocabulary and style
    of actual documents in the knowledge base

Step 3: Use the hypothetical embedding for retrieval
  Search the knowledge base using this embedding

The hypothetical answer uses the vocabulary, style, and structure of medical documents — making it a better anchor for retrieval than the question embedding.

Implementation

Python

from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def generate_hypothetical_document(query: str, domain: str = "clinical medicine") -> str:
    """Generate a hypothetical document that would answer the query."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # fast, cheap model
        max_tokens=300,
        messages=[{"role": "user", "content":
            f"""Write a short, factual paragraph from a {domain} reference document
that would directly answer this question. Use formal medical language.
Write as if excerpting from a clinical guideline or medical reference.
Do NOT say 'This document answers...' — just write the content directly.

Question: {query}"""}]
    )
    return response.content[0].text.strip()

def hyde_retrieve(
    query: str,
    vector_search_fn,  # function(embedding, top_k) -> list[dict]
    top_k: int = 5
) -> list[dict]:
    """Retrieve using a hypothetical document embedding."""
    hypothetical = generate_hypothetical_document(query)
    hyp_embedding = embedder.encode(hypothetical)
    return vector_search_fn(hyp_embedding, top_k=top_k)

# Example:
query = "Is Warfarin safe during pregnancy?"
hyp_doc = generate_hypothetical_document(query)
print(hyp_doc)
# → "Warfarin (coumadin) is classified as FDA Pregnancy Category X and is
#    contraindicated during pregnancy. First trimester exposure may cause
#    Warfarin embryopathy (nasal hypoplasia, stippled epiphyses). Near-term
#    exposure carries risk of neonatal hemorrhage..."

HyDE vs Direct Query Retrieval

Direct query embedding:
  Query: "Is Warfarin safe during pregnancy?"
  embedding of short conversational question
  → may not match clinical reference document embeddings closely

HyDE:
  Hypothetical: "Warfarin is FDA Category X, contraindicated in pregnancy..."
  embedding of clinical-style text
  → much closer to actual clinical document embeddings

Empirical results (Gao et al., 2022):
  HyDE outperforms direct embedding on most BEIR benchmark tasks
  Gains are largest for: fact retrieval, medical/scientific queries
  Gains are smaller for: simpler factual lookups

When HyDE Helps and Hurts

HyDE helps:
  Complex technical questions where query vocabulary differs from document vocabulary
  Medical and scientific queries (formal vs conversational gap is large)
  Queries about specific clinical scenarios or guidelines
  Low-resource languages (generate hypothetical in the target language)

HyDE hurts or adds little:
  Queries where the model doesn't know the answer (hallucinated hypothesis)
  Simple keyword lookups ("What is Warfarin?") — gap is small already
  Domains where the model has limited knowledge (may generate wrong hypothesis)

Clinical risk:
  If the hypothetical answer is factually wrong, it may retrieve irrelevant docs
  The LLM's hallucinations could misdirect retrieval
  Mitigation: use HyDE only when you trust the model's domain knowledge,
  or combine with standard retrieval and use RRF to merge

Ensemble: HyDE + Standard Retrieval

More robust than either alone: