Learnixo
Back to blog
AI Systemsintermediate

Maximal Marginal Relevance (MMR)

How MMR balances relevance and diversity in RAG retrieval, the algorithm, when to use it, and implementation with embeddings.

Asma Hafeez KhanMay 16, 20264 min read
RAGMMRDiversityRetrievalInterview
Share:𝕏

The Redundancy Problem

Standard similarity search retrieves the k most relevant documents — but they can be near-duplicates:

Query: "Warfarin dose adjustment guidelines"

Top 5 by cosine similarity:
  1. "Warfarin dosing: standard protocols" (score: 0.91)
  2. "Warfarin dosing guidelines for adults" (score: 0.90)  ← near-duplicate of 1
  3. "Warfarin dose adjustment protocol" (score: 0.89)      ← near-duplicate of 1
  4. "Warfarin and CYP2C9 dosing" (score: 0.87)            ← different angle!
  5. "Warfarin monitoring INR targets" (score: 0.85)        ← different angle!

Documents 1-3 say essentially the same thing.
The LLM sees redundant context — wastes the context window.
Documents 4-5, more informative, might not make the final top-k.

MMR solves this by penalising documents similar to already-selected ones.


The MMR Algorithm

MMR selects documents one at a time.
At each step, select the document that maximises:

MMR(dᵢ) = λ · Sim(dᵢ, query) - (1-λ) · max_{dⱼ ∈ S} Sim(dᵢ, dⱼ)

where:
  S      = set of already-selected documents
  λ      = 0 to 1 trade-off between relevance and diversity
  λ = 1  → standard relevance ranking (no diversity)
  λ = 0  → maximum diversity (ignore query relevance)
  λ = 0.5 → balanced (typical default)

Implementation

Python
import numpy as np
from typing import NamedTuple

class Document(NamedTuple):
    id: str
    content: str
    embedding: np.ndarray

def mmr(
    query_embedding: np.ndarray,
    documents: list[Document],
    k: int = 5,
    lambda_mult: float = 0.5,
) -> list[Document]:
    """
    Select k documents from candidates using Maximal Marginal Relevance.
    """
    if not documents:
        return []

    # Precompute cosine similarities to query
    doc_embeddings = np.stack([d.embedding for d in documents])
    query_norm = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
    doc_norms = doc_embeddings / (np.linalg.norm(doc_embeddings, axis=1, keepdims=True) + 1e-9)
    query_sims = doc_norms @ query_norm  # (n_docs,)

    selected_indices: list[int] = []
    remaining_indices = list(range(len(documents)))

    for _ in range(min(k, len(documents))):
        if not selected_indices:
            # First selection: pure relevance
            best_idx = int(np.argmax(query_sims))
        else:
            # Subsequent selections: MMR score
            selected_embeddings = doc_norms[selected_indices]  # (n_selected, dim)

            best_score = float("-inf")
            best_idx = -1
            for idx in remaining_indices:
                relevance = query_sims[idx]
                # Max similarity to any already-selected document
                diversity = float(np.max(doc_norms[idx] @ selected_embeddings.T))
                score = lambda_mult * relevance - (1 - lambda_mult) * diversity
                if score > best_score:
                    best_score = score
                    best_idx = idx

        selected_indices.append(best_idx)
        remaining_indices.remove(best_idx)

    return [documents[i] for i in selected_indices]

Lambda Trade-off Examples

Query: "Warfarin dose adjustment guidelines"

λ = 1.0 (pure relevance — no MMR):
  Returns: [dosing protocol, dosing protocol v2, dosing guidelines, dosing protocol v3, CYP2C9 dosing]
  → First 4 are near-duplicates

λ = 0.5 (balanced):
  Returns: [dosing protocol, CYP2C9 dosing, INR monitoring, drug interactions, pregnancy dosing]
  → Each document adds new information

λ = 0.0 (pure diversity):
  Returns: [dosing protocol, paediatric warfarin, anticoagulant history, INR testing, stroke risk]
  → Maximum diversity, may miss highly relevant documents

When to Use MMR

Use MMR:
  Long-form Q&A where context diversity improves answer quality
  Report generation requiring coverage across multiple aspects
  Document collections with many near-duplicates
  User-facing search where repeated similar results look broken

Don't use MMR:
  Simple factual lookup (want the most relevant single document)
  Legal or medical citations where the most authoritative source matters
  When retrieval is already diverse by design (semantic chunking)
  
Clinical example:
  Query: "AF treatment options for elderly patients with CKD"
  MMR retrieves: anticoagulation options, rate control, rhythm control, renal dosing
  Standard retrieval: mostly anticoagulation documents, misses renal context

LangChain Integration

Python
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vectorstore = FAISS.from_texts(texts, OpenAIEmbeddings())

# MMR retrieval
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,         # number to return
        "fetch_k": 20,  # initial candidate pool size
        "lambda_mult": 0.5
    }
)

results = retriever.get_relevant_documents("Warfarin dose adjustment")

Interview Answer

"MMR (Maximal Marginal Relevance) balances relevance and diversity in retrieved results. At each step, it selects the document maximising λ·Sim(d, query) - (1-λ)·max_j Sim(d, dⱼ), where dⱼ are already-selected documents. λ=1 is pure relevance; λ=0.5 balances coverage. It prevents the context window from being filled with near-duplicate documents — common in corpora with many similar passages. Use MMR when you need comprehensive coverage of a topic (e.g., 'AF treatment in elderly with CKD' should retrieve anticoagulation, rate control, and renal dosing documents, not 5 anticoagulation variants). Trade-off: slightly lower top-1 precision in exchange for better context diversity."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.