Live Coding Interview Prep · Lesson 7 of 16
Implement TF-IDF from Scratch
The Problem
TF-IDF is a classic technique for scoring how relevant a word is to a document in a collection. It's the backbone of keyword search and appears frequently in AI interviews.
Task: Given a list of documents, implement TF-IDF to find the most relevant document for a query.
The Math
Term Frequency (TF): How often a term appears in a document, normalized by document length.
TF(t, d) = count(t in d) / len(d)Inverse Document Frequency (IDF): How rare a term is across all documents. Common words (the, and) get low IDF; rare domain words get high IDF.
IDF(t, D) = log(N / (1 + df(t)))
where N = total documents, df(t) = documents containing tTF-IDF score:
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)Step-by-Step Implementation
import math
from collections import Counter
class TFIDF:
def __init__(self):
self.idf: dict[str, float] = {}
self.documents: list[list[str]] = []
def tokenize(self, text: str) -> list[str]:
"""Simple whitespace tokenizer with lowercasing."""
return text.lower().split()
def fit(self, documents: list[str]):
"""Compute IDF from a corpus of documents."""
self.documents = [self.tokenize(doc) for doc in documents]
N = len(self.documents)
# Count how many documents contain each term
df: dict[str, int] = Counter()
for doc_tokens in self.documents:
for term in set(doc_tokens): # set() to count each term once per doc
df[term] += 1
# Compute IDF for each term
# +1 in denominator prevents division by zero
self.idf = {term: math.log(N / (1 + count)) for term, count in df.items()}
def tf(self, term: str, doc_tokens: list[str]) -> float:
"""Compute TF for a term in a document."""
count = doc_tokens.count(term)
return count / len(doc_tokens) if doc_tokens else 0.0
def tfidf_vector(self, doc_tokens: list[str]) -> dict[str, float]:
"""Compute TF-IDF vector for a tokenized document."""
return {
term: self.tf(term, doc_tokens) * self.idf.get(term, 0.0)
for term in set(doc_tokens)
}
def query(self, query_text: str, documents: list[str]) -> list[tuple[int, float]]:
"""Return (doc_index, score) pairs sorted by relevance."""
self.fit(documents)
query_terms = self.tokenize(query_text)
scores = []
for idx, doc_tokens in enumerate(self.documents):
# Score = sum of TF-IDF for each query term in this document
score = sum(
self.tf(term, doc_tokens) * self.idf.get(term, 0.0)
for term in query_terms
)
scores.append((idx, score))
return sorted(scores, key=lambda x: -x[1])Testing It
documents = [
"Metformin inhibits hepatic glucose production via AMPK activation",
"Warfarin inhibits vitamin K epoxide reductase reducing clotting factors",
"Ibuprofen inhibits COX-1 and COX-2 reducing prostaglandin synthesis",
"Metformin is used for type 2 diabetes with low hypoglycemia risk",
"Aspirin inhibits COX-1 reducing thromboxane and platelet aggregation",
]
tfidf = TFIDF()
results = tfidf.query("metformin diabetes glucose", documents)
print("Query: metformin diabetes glucose")
for idx, score in results:
print(f" [{score:.4f}] {documents[idx][:60]}")
# Expected: documents 0 and 3 (both mention metformin) should rank highestOutput:
Query: metformin diabetes glucose
[0.0821] Metformin inhibits hepatic glucose production via AMPK...
[0.0614] Metformin is used for type 2 diabetes with low hypoglycemia...
[0.0000] Warfarin inhibits vitamin K epoxide reductase reducing...Common Interview Follow-Ups
Q: Why does TF-IDF work better than pure term frequency?
TF alone favors documents where a term appears many times — but if the term appears in every document, it carries no discriminating information. IDF penalizes common terms, so "the" doesn't score highly just because it appears often.
Q: What are the limitations of TF-IDF?
- No semantic understanding: "ibuprofen" and "NSAID" are unrelated in TF-IDF vocabulary
- No context: can't distinguish "drug does NOT cause X" from "drug causes X"
- Vocabulary mismatch: query uses different words than documents
These limitations are exactly why dense vector embeddings (SBERT, OpenAI text-embedding) replaced TF-IDF for semantic search.
Q: How would you handle document length normalization?
Cosine similarity between document TF-IDF vectors naturally normalizes for length. Or use sublinear TF scaling: 1 + log(count) instead of raw count.
Using sklearn's Implementation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
vectorizer = TfidfVectorizer(
lowercase=True,
stop_words="english", # Remove common words
ngram_range=(1, 2), # Include bigrams
max_features=10_000,
)
doc_matrix = vectorizer.fit_transform(documents)
query_vec = vectorizer.transform(["metformin diabetes glucose"])
similarities = cosine_similarity(query_vec, doc_matrix)[0]
ranked = sorted(enumerate(similarities), key=lambda x: -x[1])
for idx, score in ranked[:3]:
print(f"[{score:.4f}] {documents[idx][:60]}")Cosine similarity is preferred over raw TF-IDF sum — it normalizes for document length so long documents don't get an unfair advantage.