Learnixo

RAG Systems · Lesson 16 of 24

BM25: Keyword-Based Retrieval

What BM25 Is

BM25 (Best Matching 25) is the standard keyword retrieval algorithm, successor to TF-IDF. It ranks documents by how well their terms match the query terms, with saturation and length normalisation:

BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k1 + 1)] / [f(qᵢ, D) + k1 × (1 - b + b × |D|/avgdl)]

Where:
  f(qᵢ, D) = term frequency of query term qᵢ in document D
  |D|       = document length
  avgdl     = average document length in corpus
  k1        = term frequency saturation (default 1.2–2.0)
  b         = length normalisation (default 0.75)
  IDF       = log((N - n(qᵢ) + 0.5) / (n(qᵢ) + 0.5))
  N         = total documents, n(qᵢ) = documents containing qᵢ

TF saturates: mentioning a term 10× is not 10× more relevant than mentioning it once.


Why BM25 Complements Embeddings

Semantic (dense) retrieval is good at:
  Synonym matching: "heart attack" ↔ "myocardial infarction"
  Paraphrase matching: "reduce clotting" ↔ "anticoagulation"
  General domain understanding

BM25 (sparse) retrieval is good at:
  Exact term matching: drug names, ICD codes, lab values
  Rare terms: "CYP2C9*3 allele" — may not be well-represented in embeddings
  Numeric values: "INR 2.5" — embeddings don't preserve exact numbers
  Acronyms: "NOAC", "VTE", "PE" — embeddings may conflate with other meanings

Hybrid (BM25 + dense) wins in most benchmarks:
  BEIR benchmark: hybrid consistently outperforms either alone
  Clinical retrieval: exact drug names + semantic context = best of both

Implementation with rank_bm25

Python
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
import numpy as np

nltk.download("punkt", quiet=True)

class BM25Retriever:
    def __init__(self, documents: list[str], metadatas: list[dict]):
        self.documents = documents
        self.metadatas = metadatas
        tokenized = [word_tokenize(doc.lower()) for doc in documents]
        self.bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)
    
    def retrieve(
        self,
        query: str,
        top_k: int = 10,
    ) -> list[dict]:
        query_tokens = word_tokenize(query.lower())
        scores = self.bm25.get_scores(query_tokens)
        
        # Normalise scores to [0, 1]
        max_score = scores.max()
        if max_score > 0:
            scores = scores / max_score
        
        top_indices = np.argsort(scores)[::-1][:top_k]
        
        return [
            {
                "content": self.documents[i],
                "metadata": self.metadatas[i],
                "bm25_score": float(scores[i]),
            }
            for i in top_indices
            if scores[i] > 0  # exclude zero-score documents
        ]

BM25 with Elasticsearch / OpenSearch

Python
from opensearchpy import OpenSearch

client = OpenSearch(hosts=["localhost:9200"])

def index_to_opensearch(
    documents: list[dict],
    index_name: str = "clinical_rag",
) -> None:
    # Create index with BM25 settings
    client.indices.create(
        index=index_name,
        body={
            "settings": {
                "similarity": {
                    "custom_bm25": {
                        "type": "BM25",
                        "k1": 1.5,
                        "b": 0.75,
                    }
                }
            },
            "mappings": {
                "properties": {
                    "content": {"type": "text", "similarity": "custom_bm25"},
                    "source": {"type": "keyword"},
                    "topic": {"type": "keyword"},
                }
            }
        },
        ignore=400,
    )
    
    for i, doc in enumerate(documents):
        client.index(index=index_name, id=i, body=doc)

def bm25_search_opensearch(
    query: str,
    top_k: int = 10,
    filter_topic: str | None = None,
) -> list[dict]:
    must_clause = [{"match": {"content": query}}]
    filter_clause = []
    if filter_topic:
        filter_clause.append({"term": {"topic": filter_topic}})
    
    response = client.search(
        index="clinical_rag",
        body={
            "query": {
                "bool": {
                    "must": must_clause,
                    "filter": filter_clause,
                }
            },
            "size": top_k,
        }
    )
    
    return [
        {
            "content": hit["_source"]["content"],
            "metadata": {k: v for k, v in hit["_source"].items() if k != "content"},
            "bm25_score": hit["_score"],
        }
        for hit in response["hits"]["hits"]
    ]

When BM25 Outperforms Dense Retrieval

Query: "CYP2C9 poor metaboliser warfarin dose adjustment"
  BM25: retrieves documents containing "CYP2C9" and "poor metaboliser" exactly
  Dense: may retrieve general warfarin dosing docs that don't mention CYP2C9

Query: "INR 4.5 management"
  BM25: retrieves "When INR is 4.0–5.0, withhold warfarin..." exactly
  Dense: retrieves semantically similar content, may miss exact INR range

Query: "NICE NG196"
  BM25: retrieves documents tagged with "NG196" precisely
  Dense: may not have learned this document identifier well

Clinical bottom line:
  Drug names, lab values, ICD codes, document identifiers → BM25 wins
  Symptom descriptions, lay language, paraphrase → dense wins
  Hybrid → always at least as good as either alone

Interview Answer

"BM25 is the standard keyword retrieval algorithm that ranks documents by term frequency with saturation and document length normalisation. It excels at exact term matching — drug names, ICD codes, numeric lab values, document identifiers — which are poorly handled by dense embeddings because they lack paraphrase structure. In clinical RAG, hybrid retrieval (BM25 + dense vector search, scores fused via Reciprocal Rank Fusion) consistently outperforms either alone. I implement BM25 with rank_bm25 for local indexes or Elasticsearch/OpenSearch for production, and combine it with the dense vector search via RRF to get both keyword precision and semantic understanding."