Contextual Compression
How contextual compression extracts only the relevant portions of retrieved documents before passing them to the LLM — reducing noise and saving context window space.
The Problem
Retrieved documents are often long with only a small portion relevant to the query:
Query: "What is the recommended INR target range for AF patients on Warfarin?"
Retrieved document (2000 words):
"Warfarin (also known as Coumadin) is an anticoagulant medication widely
used in clinical practice. It was first developed in the 1950s...
[long history section]
...Pharmacokinetics: Warfarin is metabolised by CYP2C9...
[long pharmacology section]
...Dosing: The therapeutic INR range for atrial fibrillation is 2.0-3.0.
Higher ranges (2.5-3.5) may be used for mechanical heart valves...
[rest of document]"
Only the 2 sentences about INR targets are relevant.
Injecting the full 2000-word document wastes context window and adds noise.Contextual Compression Approaches
Approach 1: LLM Extractor
Ask an LLM to extract only the relevant portion from each document
Approach 2: Embeddings Filter
Filter out sentences/passages with low similarity to the query
Approach 3: LLM Filter
Ask an LLM to judge whether each document is relevant at all
(before extraction — discard wholly irrelevant documents)
Most effective: LLM Filter → LLM Extractor pipeline
Discard irrelevant documents first (cheaper)
Then extract relevant passages from relevant documentsLLM Extractor Implementation
from anthropic import Anthropic
client = Anthropic()
def extract_relevant_passages(
query: str,
document: str,
max_extracted_length: int = 500
) -> str | None:
"""
Extract only the portions of the document relevant to the query.
Returns None if no relevant content found.
"""
prompt = f"""Extract the portions of the document that are directly relevant
to answering the query. Return only the relevant text, verbatim from the document.
If no relevant information exists, respond with exactly: "NOT RELEVANT"
Query: {query}
Document:
<document>
{document}
</document>
Relevant portions:"""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # use small/fast model for extraction
max_tokens=max_extracted_length,
messages=[{"role": "user", "content": prompt}]
)
result = response.content[0].text.strip()
return None if result == "NOT RELEVANT" else result
def compress_retrieved_documents(
query: str,
documents: list[dict],
max_docs: int = 5
) -> list[dict]:
"""Extract relevant passages from all retrieved documents."""
compressed = []
for doc in documents:
extracted = extract_relevant_passages(query, doc["content"])
if extracted:
compressed.append({
"id": doc["id"],
"content": extracted,
"original_length": len(doc["content"]),
"compressed_length": len(extracted)
})
return compressed[:max_docs]Embedding Filter (No LLM Call)
Cheaper approach: filter sentences by embedding similarity to the query:
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def embedding_filter_compression(
query: str,
document: str,
similarity_threshold: float = 0.5
) -> str:
"""Keep only sentences with high similarity to the query."""
sentences = [s.strip() for s in document.split(".") if s.strip()]
if not sentences:
return ""
query_emb = model.encode([query])[0]
sent_embs = model.encode(sentences)
similarities = sent_embs @ query_emb / (
np.linalg.norm(sent_embs, axis=1) * np.linalg.norm(query_emb) + 1e-9
)
relevant = [s for s, sim in zip(sentences, similarities) if sim >= similarity_threshold]
return ". ".join(relevant)LangChain Contextual Compression
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Build base retriever
vectorstore = FAISS.from_texts(documents, OpenAIEmbeddings())
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Wrap with contextual compression
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # fast/cheap model
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
results = compression_retriever.get_relevant_documents(
"What is the INR target range for AF patients on Warfarin?"
)
# Each result contains only the extracted relevant portionsCost and Latency
Without compression:
Retrieve 5 docs × 2000 tokens each = 10,000 tokens injected into LLM
LLM processes 10K tokens of context
With compression:
Retrieve 5 docs × 2000 tokens → extract 100-300 tokens per doc
LLM processes 500-1500 tokens of context
Savings:
7-19× less context → lower cost and latency for the final LLM call
Higher quality: less noise improves answer accuracy
Additional cost:
5 extractor LLM calls (cheap model) per request
Net: usually positive ROI, especially for long documents or expensive LLMs
For clinical notes:
Notes can be 1000-5000 tokens
Query is about one specific clinical question
Compression reduces context by 80-95% typicallyInterview Answer
"Contextual compression extracts only the query-relevant portions of retrieved documents before passing them to the LLM. The standard approach: retrieve 10-20 candidates, then for each document call a small fast LLM (Claude Haiku, GPT-4o mini) asking it to extract only the relevant sentences verbatim. If no relevant content exists, discard the document. The compressed context is 80-95% smaller — reducing final LLM cost and latency while improving answer quality by removing noise. The extraction step adds some latency and cost, but the net effect is usually positive. An alternative is embedding-based sentence filtering — cheaper but less accurate than LLM extraction."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.