Maximal Marginal Relevance (MMR)
How MMR balances relevance and diversity in RAG retrieval, the algorithm, when to use it, and implementation with embeddings.
The Redundancy Problem
Standard similarity search retrieves the k most relevant documents — but they can be near-duplicates:
Query: "Warfarin dose adjustment guidelines"
Top 5 by cosine similarity:
1. "Warfarin dosing: standard protocols" (score: 0.91)
2. "Warfarin dosing guidelines for adults" (score: 0.90) ← near-duplicate of 1
3. "Warfarin dose adjustment protocol" (score: 0.89) ← near-duplicate of 1
4. "Warfarin and CYP2C9 dosing" (score: 0.87) ← different angle!
5. "Warfarin monitoring INR targets" (score: 0.85) ← different angle!
Documents 1-3 say essentially the same thing.
The LLM sees redundant context — wastes the context window.
Documents 4-5, more informative, might not make the final top-k.MMR solves this by penalising documents similar to already-selected ones.
The MMR Algorithm
MMR selects documents one at a time.
At each step, select the document that maximises:
MMR(dᵢ) = λ · Sim(dᵢ, query) - (1-λ) · max_{dⱼ ∈ S} Sim(dᵢ, dⱼ)
where:
S = set of already-selected documents
λ = 0 to 1 trade-off between relevance and diversity
λ = 1 → standard relevance ranking (no diversity)
λ = 0 → maximum diversity (ignore query relevance)
λ = 0.5 → balanced (typical default)Implementation
import numpy as np
from typing import NamedTuple
class Document(NamedTuple):
id: str
content: str
embedding: np.ndarray
def mmr(
query_embedding: np.ndarray,
documents: list[Document],
k: int = 5,
lambda_mult: float = 0.5,
) -> list[Document]:
"""
Select k documents from candidates using Maximal Marginal Relevance.
"""
if not documents:
return []
# Precompute cosine similarities to query
doc_embeddings = np.stack([d.embedding for d in documents])
query_norm = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
doc_norms = doc_embeddings / (np.linalg.norm(doc_embeddings, axis=1, keepdims=True) + 1e-9)
query_sims = doc_norms @ query_norm # (n_docs,)
selected_indices: list[int] = []
remaining_indices = list(range(len(documents)))
for _ in range(min(k, len(documents))):
if not selected_indices:
# First selection: pure relevance
best_idx = int(np.argmax(query_sims))
else:
# Subsequent selections: MMR score
selected_embeddings = doc_norms[selected_indices] # (n_selected, dim)
best_score = float("-inf")
best_idx = -1
for idx in remaining_indices:
relevance = query_sims[idx]
# Max similarity to any already-selected document
diversity = float(np.max(doc_norms[idx] @ selected_embeddings.T))
score = lambda_mult * relevance - (1 - lambda_mult) * diversity
if score > best_score:
best_score = score
best_idx = idx
selected_indices.append(best_idx)
remaining_indices.remove(best_idx)
return [documents[i] for i in selected_indices]Lambda Trade-off Examples
Query: "Warfarin dose adjustment guidelines"
λ = 1.0 (pure relevance — no MMR):
Returns: [dosing protocol, dosing protocol v2, dosing guidelines, dosing protocol v3, CYP2C9 dosing]
→ First 4 are near-duplicates
λ = 0.5 (balanced):
Returns: [dosing protocol, CYP2C9 dosing, INR monitoring, drug interactions, pregnancy dosing]
→ Each document adds new information
λ = 0.0 (pure diversity):
Returns: [dosing protocol, paediatric warfarin, anticoagulant history, INR testing, stroke risk]
→ Maximum diversity, may miss highly relevant documentsWhen to Use MMR
Use MMR:
Long-form Q&A where context diversity improves answer quality
Report generation requiring coverage across multiple aspects
Document collections with many near-duplicates
User-facing search where repeated similar results look broken
Don't use MMR:
Simple factual lookup (want the most relevant single document)
Legal or medical citations where the most authoritative source matters
When retrieval is already diverse by design (semantic chunking)
Clinical example:
Query: "AF treatment options for elderly patients with CKD"
MMR retrieves: anticoagulation options, rate control, rhythm control, renal dosing
Standard retrieval: mostly anticoagulation documents, misses renal contextLangChain Integration
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vectorstore = FAISS.from_texts(texts, OpenAIEmbeddings())
# MMR retrieval
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5, # number to return
"fetch_k": 20, # initial candidate pool size
"lambda_mult": 0.5
}
)
results = retriever.get_relevant_documents("Warfarin dose adjustment")Interview Answer
"MMR (Maximal Marginal Relevance) balances relevance and diversity in retrieved results. At each step, it selects the document maximising λ·Sim(d, query) - (1-λ)·max_j Sim(d, dⱼ), where dⱼ are already-selected documents. λ=1 is pure relevance; λ=0.5 balances coverage. It prevents the context window from being filled with near-duplicate documents — common in corpora with many similar passages. Use MMR when you need comprehensive coverage of a topic (e.g., 'AF treatment in elderly with CKD' should retrieve anticoagulation, rate control, and renal dosing documents, not 5 anticoagulation variants). Trade-off: slightly lower top-1 precision in exchange for better context diversity."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.