RAG Systems · Lesson 16 of 24
BM25: Keyword-Based Retrieval
What BM25 Is
BM25 (Best Matching 25) is the standard keyword retrieval algorithm, successor to TF-IDF. It ranks documents by how well their terms match the query terms, with saturation and length normalisation:
BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k1 + 1)] / [f(qᵢ, D) + k1 × (1 - b + b × |D|/avgdl)]
Where:
f(qᵢ, D) = term frequency of query term qᵢ in document D
|D| = document length
avgdl = average document length in corpus
k1 = term frequency saturation (default 1.2–2.0)
b = length normalisation (default 0.75)
IDF = log((N - n(qᵢ) + 0.5) / (n(qᵢ) + 0.5))
N = total documents, n(qᵢ) = documents containing qᵢTF saturates: mentioning a term 10× is not 10× more relevant than mentioning it once.
Why BM25 Complements Embeddings
Semantic (dense) retrieval is good at:
Synonym matching: "heart attack" ↔ "myocardial infarction"
Paraphrase matching: "reduce clotting" ↔ "anticoagulation"
General domain understanding
BM25 (sparse) retrieval is good at:
Exact term matching: drug names, ICD codes, lab values
Rare terms: "CYP2C9*3 allele" — may not be well-represented in embeddings
Numeric values: "INR 2.5" — embeddings don't preserve exact numbers
Acronyms: "NOAC", "VTE", "PE" — embeddings may conflate with other meanings
Hybrid (BM25 + dense) wins in most benchmarks:
BEIR benchmark: hybrid consistently outperforms either alone
Clinical retrieval: exact drug names + semantic context = best of bothImplementation with rank_bm25
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
nltk.download("punkt", quiet=True)
class BM25Retriever:
def __init__(self, documents: list[str], metadatas: list[dict]):
self.documents = documents
self.metadatas = metadatas
tokenized = [word_tokenize(doc.lower()) for doc in documents]
self.bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)
def retrieve(
self,
query: str,
top_k: int = 10,
) -> list[dict]:
query_tokens = word_tokenize(query.lower())
scores = self.bm25.get_scores(query_tokens)
# Normalise scores to [0, 1]
max_score = scores.max()
if max_score > 0:
scores = scores / max_score
top_indices = np.argsort(scores)[::-1][:top_k]
return [
{
"content": self.documents[i],
"metadata": self.metadatas[i],
"bm25_score": float(scores[i]),
}
for i in top_indices
if scores[i] > 0 # exclude zero-score documents
]BM25 with Elasticsearch / OpenSearch
from opensearchpy import OpenSearch
client = OpenSearch(hosts=["localhost:9200"])
def index_to_opensearch(
documents: list[dict],
index_name: str = "clinical_rag",
) -> None:
# Create index with BM25 settings
client.indices.create(
index=index_name,
body={
"settings": {
"similarity": {
"custom_bm25": {
"type": "BM25",
"k1": 1.5,
"b": 0.75,
}
}
},
"mappings": {
"properties": {
"content": {"type": "text", "similarity": "custom_bm25"},
"source": {"type": "keyword"},
"topic": {"type": "keyword"},
}
}
},
ignore=400,
)
for i, doc in enumerate(documents):
client.index(index=index_name, id=i, body=doc)
def bm25_search_opensearch(
query: str,
top_k: int = 10,
filter_topic: str | None = None,
) -> list[dict]:
must_clause = [{"match": {"content": query}}]
filter_clause = []
if filter_topic:
filter_clause.append({"term": {"topic": filter_topic}})
response = client.search(
index="clinical_rag",
body={
"query": {
"bool": {
"must": must_clause,
"filter": filter_clause,
}
},
"size": top_k,
}
)
return [
{
"content": hit["_source"]["content"],
"metadata": {k: v for k, v in hit["_source"].items() if k != "content"},
"bm25_score": hit["_score"],
}
for hit in response["hits"]["hits"]
]When BM25 Outperforms Dense Retrieval
Query: "CYP2C9 poor metaboliser warfarin dose adjustment"
BM25: retrieves documents containing "CYP2C9" and "poor metaboliser" exactly
Dense: may retrieve general warfarin dosing docs that don't mention CYP2C9
Query: "INR 4.5 management"
BM25: retrieves "When INR is 4.0–5.0, withhold warfarin..." exactly
Dense: retrieves semantically similar content, may miss exact INR range
Query: "NICE NG196"
BM25: retrieves documents tagged with "NG196" precisely
Dense: may not have learned this document identifier well
Clinical bottom line:
Drug names, lab values, ICD codes, document identifiers → BM25 wins
Symptom descriptions, lay language, paraphrase → dense wins
Hybrid → always at least as good as either aloneInterview Answer
"BM25 is the standard keyword retrieval algorithm that ranks documents by term frequency with saturation and document length normalisation. It excels at exact term matching — drug names, ICD codes, numeric lab values, document identifiers — which are poorly handled by dense embeddings because they lack paraphrase structure. In clinical RAG, hybrid retrieval (BM25 + dense vector search, scores fused via Reciprocal Rank Fusion) consistently outperforms either alone. I implement BM25 with rank_bm25 for local indexes or Elasticsearch/OpenSearch for production, and combine it with the dense vector search via RRF to get both keyword precision and semantic understanding."