Contextual Compression

The Problem

Retrieved documents are often long with only a small portion relevant to the query:

Query: "What is the recommended INR target range for AF patients on Warfarin?"

Retrieved document (2000 words):
  "Warfarin (also known as Coumadin) is an anticoagulant medication widely
   used in clinical practice. It was first developed in the 1950s...
   [long history section]
   ...Pharmacokinetics: Warfarin is metabolised by CYP2C9...
   [long pharmacology section]
   ...Dosing: The therapeutic INR range for atrial fibrillation is 2.0-3.0.
   Higher ranges (2.5-3.5) may be used for mechanical heart valves...
   [rest of document]"

Only the 2 sentences about INR targets are relevant.
Injecting the full 2000-word document wastes context window and adds noise.

Contextual Compression Approaches

Approach 1: LLM Extractor
  Ask an LLM to extract only the relevant portion from each document

Approach 2: Embeddings Filter
  Filter out sentences/passages with low similarity to the query

Approach 3: LLM Filter
  Ask an LLM to judge whether each document is relevant at all
  (before extraction — discard wholly irrelevant documents)

Most effective: LLM Filter → LLM Extractor pipeline
  Discard irrelevant documents first (cheaper)
  Then extract relevant passages from relevant documents

LLM Extractor Implementation

Python

from anthropic import Anthropic

client = Anthropic()

def extract_relevant_passages(
    query: str,
    document: str,
    max_extracted_length: int = 500
) -> str | None:
    """
    Extract only the portions of the document relevant to the query.
    Returns None if no relevant content found.
    """
    prompt = f"""Extract the portions of the document that are directly relevant
to answering the query. Return only the relevant text, verbatim from the document.
If no relevant information exists, respond with exactly: "NOT RELEVANT"

Query: {query}

Document:
<document>
{document}
</document>

Relevant portions:"""

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # use small/fast model for extraction
        max_tokens=max_extracted_length,
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text.strip()
    return None if result == "NOT RELEVANT" else result

def compress_retrieved_documents(
    query: str,
    documents: list[dict],
    max_docs: int = 5
) -> list[dict]:
    """Extract relevant passages from all retrieved documents."""
    compressed = []
    for doc in documents:
        extracted = extract_relevant_passages(query, doc["content"])
        if extracted:
            compressed.append({
                "id": doc["id"],
                "content": extracted,
                "original_length": len(doc["content"]),
                "compressed_length": len(extracted)
            })

    return compressed[:max_docs]

Embedding Filter (No LLM Call)

Cheaper approach: filter sentences by embedding similarity to the query:

Python

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def embedding_filter_compression(
    query: str,
    document: str,
    similarity_threshold: float = 0.5
) -> str:
    """Keep only sentences with high similarity to the query."""
    sentences = [s.strip() for s in document.split(".") if s.strip()]
    if not sentences:
        return ""

    query_emb = model.encode([query])[0]
    sent_embs = model.encode(sentences)

    similarities = sent_embs @ query_emb / (
        np.linalg.norm(sent_embs, axis=1) * np.linalg.norm(query_emb) + 1e-9
    )

    relevant = [s for s, sim in zip(sentences, similarities) if sim >= similarity_threshold]
    return ". ".join(relevant)

LangChain Contextual Compression

Python

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Build base retriever
vectorstore = FAISS.from_texts(documents, OpenAIEmbeddings())
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Wrap with contextual compression
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  # fast/cheap model
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

results = compression_retriever.get_relevant_documents(
    "What is the INR target range for AF patients on Warfarin?"
)
# Each result contains only the extracted relevant portions

Cost and Latency

Without compression:
  Retrieve 5 docs × 2000 tokens each = 10,000 tokens injected into LLM
  LLM processes 10K tokens of context

With compression:
  Retrieve 5 docs × 2000 tokens → extract 100-300 tokens per doc
  LLM processes 500-1500 tokens of context
  
Savings:
  7-19× less context → lower cost and latency for the final LLM call
  Higher quality: less noise improves answer accuracy

Additional cost:
  5 extractor LLM calls (cheap model) per request
  Net: usually positive ROI, especially for long documents or expensive LLMs

For clinical notes:
  Notes can be 1000-5000 tokens
  Query is about one specific clinical question
  Compression reduces context by 80-95% typically

Interview Answer

"Contextual compression extracts only the query-relevant portions of retrieved documents before passing them to the LLM. The standard approach: retrieve 10-20 candidates, then for each document call a small fast LLM (Claude Haiku, GPT-4o mini) asking it to extract only the relevant sentences verbatim. If no relevant content exists, discard the document. The compressed context is 80-95% smaller — reducing final LLM cost and latency while improving answer quality by removing noise. The extraction step adds some latency and cost, but the net effect is usually positive. An alternative is embedding-based sentence filtering — cheaper but less accurate than LLM extraction."

Contextual Compression

The Problem

Contextual Compression Approaches

LLM Extractor Implementation

Embedding Filter (No LLM Call)

LangChain Contextual Compression

Cost and Latency

Interview Answer

Enjoyed this article?

Leave a comment