Maximal Marginal Relevance (MMR)

The Redundancy Problem

Standard similarity search retrieves the k most relevant documents — but they can be near-duplicates:

Query: "Warfarin dose adjustment guidelines"

Top 5 by cosine similarity:
  1. "Warfarin dosing: standard protocols" (score: 0.91)
  2. "Warfarin dosing guidelines for adults" (score: 0.90)  ← near-duplicate of 1
  3. "Warfarin dose adjustment protocol" (score: 0.89)      ← near-duplicate of 1
  4. "Warfarin and CYP2C9 dosing" (score: 0.87)            ← different angle!
  5. "Warfarin monitoring INR targets" (score: 0.85)        ← different angle!

Documents 1-3 say essentially the same thing.
The LLM sees redundant context — wastes the context window.
Documents 4-5, more informative, might not make the final top-k.

MMR solves this by penalising documents similar to already-selected ones.

The MMR Algorithm

MMR selects documents one at a time.
At each step, select the document that maximises:

MMR(dᵢ) = λ · Sim(dᵢ, query) - (1-λ) · max_{dⱼ ∈ S} Sim(dᵢ, dⱼ)

where:
  S      = set of already-selected documents
  λ      = 0 to 1 trade-off between relevance and diversity
  λ = 1  → standard relevance ranking (no diversity)
  λ = 0  → maximum diversity (ignore query relevance)
  λ = 0.5 → balanced (typical default)

Implementation

Python

import numpy as np
from typing import NamedTuple

class Document(NamedTuple):
    id: str
    content: str
    embedding: np.ndarray

def mmr(
    query_embedding: np.ndarray,
    documents: list[Document],
    k: int = 5,
    lambda_mult: float = 0.5,
) -> list[Document]:
    """
    Select k documents from candidates using Maximal Marginal Relevance.
    """
    if not documents:
        return []

    # Precompute cosine similarities to query
    doc_embeddings = np.stack([d.embedding for d in documents])
    query_norm = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
    doc_norms = doc_embeddings / (np.linalg.norm(doc_embeddings, axis=1, keepdims=True) + 1e-9)
    query_sims = doc_norms @ query_norm  # (n_docs,)

    selected_indices: list[int] = []
    remaining_indices = list(range(len(documents)))

    for _ in range(min(k, len(documents))):
        if not selected_indices:
            # First selection: pure relevance
            best_idx = int(np.argmax(query_sims))
        else:
            # Subsequent selections: MMR score
            selected_embeddings = doc_norms[selected_indices]  # (n_selected, dim)

            best_score = float("-inf")
            best_idx = -1
            for idx in remaining_indices:
                relevance = query_sims[idx]
                # Max similarity to any already-selected document
                diversity = float(np.max(doc_norms[idx] @ selected_embeddings.T))
                score = lambda_mult * relevance - (1 - lambda_mult) * diversity
                if score > best_score:
                    best_score = score
                    best_idx = idx

        selected_indices.append(best_idx)
        remaining_indices.remove(best_idx)

    return [documents[i] for i in selected_indices]

Lambda Trade-off Examples

Query: "Warfarin dose adjustment guidelines"

λ = 1.0 (pure relevance — no MMR):
  Returns: [dosing protocol, dosing protocol v2, dosing guidelines, dosing protocol v3, CYP2C9 dosing]
  → First 4 are near-duplicates

λ = 0.5 (balanced):
  Returns: [dosing protocol, CYP2C9 dosing, INR monitoring, drug interactions, pregnancy dosing]
  → Each document adds new information

λ = 0.0 (pure diversity):
  Returns: [dosing protocol, paediatric warfarin, anticoagulant history, INR testing, stroke risk]
  → Maximum diversity, may miss highly relevant documents

When to Use MMR

Use MMR:
  Long-form Q&A where context diversity improves answer quality
  Report generation requiring coverage across multiple aspects
  Document collections with many near-duplicates
  User-facing search where repeated similar results look broken

Don't use MMR:
  Simple factual lookup (want the most relevant single document)
  Legal or medical citations where the most authoritative source matters
  When retrieval is already diverse by design (semantic chunking)
  
Clinical example:
  Query: "AF treatment options for elderly patients with CKD"
  MMR retrieves: anticoagulation options, rate control, rhythm control, renal dosing
  Standard retrieval: mostly anticoagulation documents, misses renal context

LangChain Integration

Python

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vectorstore = FAISS.from_texts(texts, OpenAIEmbeddings())

# MMR retrieval
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,         # number to return
        "fetch_k": 20,  # initial candidate pool size
        "lambda_mult": 0.5
    }
)

results = retriever.get_relevant_documents("Warfarin dose adjustment")

Interview Answer

"MMR (Maximal Marginal Relevance) balances relevance and diversity in retrieved results. At each step, it selects the document maximising λ·Sim(d, query) - (1-λ)·max_j Sim(d, dⱼ), where dⱼ are already-selected documents. λ=1 is pure relevance; λ=0.5 balances coverage. It prevents the context window from being filled with near-duplicate documents — common in corpora with many similar passages. Use MMR when you need comprehensive coverage of a topic (e.g., 'AF treatment in elderly with CKD' should retrieve anticoagulation, rate control, and renal dosing documents, not 5 anticoagulation variants). Trade-off: slightly lower top-1 precision in exchange for better context diversity."

Maximal Marginal Relevance (MMR)

The Redundancy Problem

The MMR Algorithm

Implementation

Lambda Trade-off Examples

When to Use MMR

LangChain Integration

Interview Answer

Enjoyed this article?

Leave a comment