Advanced RAG · Lesson 3 of 14
MMR: Maximum Marginal Relevance
The Redundancy Problem
Standard similarity search retrieves the k most relevant documents — but they can be near-duplicates:
Query: "Warfarin dose adjustment guidelines"
Top 5 by cosine similarity:
1. "Warfarin dosing: standard protocols" (score: 0.91)
2. "Warfarin dosing guidelines for adults" (score: 0.90) ← near-duplicate of 1
3. "Warfarin dose adjustment protocol" (score: 0.89) ← near-duplicate of 1
4. "Warfarin and CYP2C9 dosing" (score: 0.87) ← different angle!
5. "Warfarin monitoring INR targets" (score: 0.85) ← different angle!
Documents 1-3 say essentially the same thing.
The LLM sees redundant context — wastes the context window.
Documents 4-5, more informative, might not make the final top-k.MMR solves this by penalising documents similar to already-selected ones.
The MMR Algorithm
MMR selects documents one at a time.
At each step, select the document that maximises:
MMR(dᵢ) = λ · Sim(dᵢ, query) - (1-λ) · max_{dⱼ ∈ S} Sim(dᵢ, dⱼ)
where:
S = set of already-selected documents
λ = 0 to 1 trade-off between relevance and diversity
λ = 1 → standard relevance ranking (no diversity)
λ = 0 → maximum diversity (ignore query relevance)
λ = 0.5 → balanced (typical default)Implementation
import numpy as np
from typing import NamedTuple
class Document(NamedTuple):
id: str
content: str
embedding: np.ndarray
def mmr(
query_embedding: np.ndarray,
documents: list[Document],
k: int = 5,
lambda_mult: float = 0.5,
) -> list[Document]:
"""
Select k documents from candidates using Maximal Marginal Relevance.
"""
if not documents:
return []
# Precompute cosine similarities to query
doc_embeddings = np.stack([d.embedding for d in documents])
query_norm = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
doc_norms = doc_embeddings / (np.linalg.norm(doc_embeddings, axis=1, keepdims=True) + 1e-9)
query_sims = doc_norms @ query_norm # (n_docs,)
selected_indices: list[int] = []
remaining_indices = list(range(len(documents)))
for _ in range(min(k, len(documents))):
if not selected_indices:
# First selection: pure relevance
best_idx = int(np.argmax(query_sims))
else:
# Subsequent selections: MMR score
selected_embeddings = doc_norms[selected_indices] # (n_selected, dim)
best_score = float("-inf")
best_idx = -1
for idx in remaining_indices:
relevance = query_sims[idx]
# Max similarity to any already-selected document
diversity = float(np.max(doc_norms[idx] @ selected_embeddings.T))
score = lambda_mult * relevance - (1 - lambda_mult) * diversity
if score > best_score:
best_score = score
best_idx = idx
selected_indices.append(best_idx)
remaining_indices.remove(best_idx)
return [documents[i] for i in selected_indices]Lambda Trade-off Examples
Query: "Warfarin dose adjustment guidelines"
λ = 1.0 (pure relevance — no MMR):
Returns: [dosing protocol, dosing protocol v2, dosing guidelines, dosing protocol v3, CYP2C9 dosing]
→ First 4 are near-duplicates
λ = 0.5 (balanced):
Returns: [dosing protocol, CYP2C9 dosing, INR monitoring, drug interactions, pregnancy dosing]
→ Each document adds new information
λ = 0.0 (pure diversity):
Returns: [dosing protocol, paediatric warfarin, anticoagulant history, INR testing, stroke risk]
→ Maximum diversity, may miss highly relevant documentsWhen to Use MMR
Use MMR:
Long-form Q&A where context diversity improves answer quality
Report generation requiring coverage across multiple aspects
Document collections with many near-duplicates
User-facing search where repeated similar results look broken
Don't use MMR:
Simple factual lookup (want the most relevant single document)
Legal or medical citations where the most authoritative source matters
When retrieval is already diverse by design (semantic chunking)
Clinical example:
Query: "AF treatment options for elderly patients with CKD"
MMR retrieves: anticoagulation options, rate control, rhythm control, renal dosing
Standard retrieval: mostly anticoagulation documents, misses renal contextLangChain Integration
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vectorstore = FAISS.from_texts(texts, OpenAIEmbeddings())
# MMR retrieval
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5, # number to return
"fetch_k": 20, # initial candidate pool size
"lambda_mult": 0.5
}
)
results = retriever.get_relevant_documents("Warfarin dose adjustment")Interview Answer
"MMR (Maximal Marginal Relevance) balances relevance and diversity in retrieved results. At each step, it selects the document maximising λ·Sim(d, query) - (1-λ)·max_j Sim(d, dⱼ), where dⱼ are already-selected documents. λ=1 is pure relevance; λ=0.5 balances coverage. It prevents the context window from being filled with near-duplicate documents — common in corpora with many similar passages. Use MMR when you need comprehensive coverage of a topic (e.g., 'AF treatment in elderly with CKD' should retrieve anticoagulation, rate control, and renal dosing documents, not 5 anticoagulation variants). Trade-off: slightly lower top-1 precision in exchange for better context diversity."