Implement TF-IDF from Scratch

The Problem

TF-IDF is a classic technique for scoring how relevant a word is to a document in a collection. It's the backbone of keyword search and appears frequently in AI interviews.

Task: Given a list of documents, implement TF-IDF to find the most relevant document for a query.

The Math

Term Frequency (TF): How often a term appears in a document, normalized by document length.

TF(t, d) = count(t in d) / len(d)

Inverse Document Frequency (IDF): How rare a term is across all documents. Common words (the, and) get low IDF; rare domain words get high IDF.

IDF(t, D) = log(N / (1 + df(t)))
where N = total documents, df(t) = documents containing t

TF-IDF score:

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

Step-by-Step Implementation

Python

import math
from collections import Counter

class TFIDF:
    def __init__(self):
        self.idf: dict[str, float] = {}
        self.documents: list[list[str]] = []

    def tokenize(self, text: str) -> list[str]:
        """Simple whitespace tokenizer with lowercasing."""
        return text.lower().split()

    def fit(self, documents: list[str]):
        """Compute IDF from a corpus of documents."""
        self.documents = [self.tokenize(doc) for doc in documents]
        N = len(self.documents)

        # Count how many documents contain each term
        df: dict[str, int] = Counter()
        for doc_tokens in self.documents:
            for term in set(doc_tokens):  # set() to count each term once per doc
                df[term] += 1

        # Compute IDF for each term
        # +1 in denominator prevents division by zero
        self.idf = {term: math.log(N / (1 + count)) for term, count in df.items()}

    def tf(self, term: str, doc_tokens: list[str]) -> float:
        """Compute TF for a term in a document."""
        count = doc_tokens.count(term)
        return count / len(doc_tokens) if doc_tokens else 0.0

    def tfidf_vector(self, doc_tokens: list[str]) -> dict[str, float]:
        """Compute TF-IDF vector for a tokenized document."""
        return {
            term: self.tf(term, doc_tokens) * self.idf.get(term, 0.0)
            for term in set(doc_tokens)
        }

    def query(self, query_text: str, documents: list[str]) -> list[tuple[int, float]]:
        """Return (doc_index, score) pairs sorted by relevance."""
        self.fit(documents)

        query_terms = self.tokenize(query_text)

        scores = []
        for idx, doc_tokens in enumerate(self.documents):
            # Score = sum of TF-IDF for each query term in this document
            score = sum(
                self.tf(term, doc_tokens) * self.idf.get(term, 0.0)
                for term in query_terms
            )
            scores.append((idx, score))

        return sorted(scores, key=lambda x: -x[1])

Testing It

Python

documents = [
    "Metformin inhibits hepatic glucose production via AMPK activation",
    "Warfarin inhibits vitamin K epoxide reductase reducing clotting factors",
    "Ibuprofen inhibits COX-1 and COX-2 reducing prostaglandin synthesis",
    "Metformin is used for type 2 diabetes with low hypoglycemia risk",
    "Aspirin inhibits COX-1 reducing thromboxane and platelet aggregation",
]

tfidf = TFIDF()
results = tfidf.query("metformin diabetes glucose", documents)

print("Query: metformin diabetes glucose")
for idx, score in results:
    print(f"  [{score:.4f}] {documents[idx][:60]}")

# Expected: documents 0 and 3 (both mention metformin) should rank highest

Output:

Query: metformin diabetes glucose
  [0.0821] Metformin inhibits hepatic glucose production via AMPK...
  [0.0614] Metformin is used for type 2 diabetes with low hypoglycemia...
  [0.0000] Warfarin inhibits vitamin K epoxide reductase reducing...

Common Interview Follow-Ups

Q: Why does TF-IDF work better than pure term frequency?

TF alone favors documents where a term appears many times — but if the term appears in every document, it carries no discriminating information. IDF penalizes common terms, so "the" doesn't score highly just because it appears often.

Q: What are the limitations of TF-IDF?

No semantic understanding: "ibuprofen" and "NSAID" are unrelated in TF-IDF vocabulary
No context: can't distinguish "drug does NOT cause X" from "drug causes X"
Vocabulary mismatch: query uses different words than documents

These limitations are exactly why dense vector embeddings (SBERT, OpenAI text-embedding) replaced TF-IDF for semantic search.

Q: How would you handle document length normalization?

Cosine similarity between document TF-IDF vectors naturally normalizes for length. Or use sublinear TF scaling: 1 + log(count) instead of raw count.

Using sklearn's Implementation

Python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",  # Remove common words
    ngram_range=(1, 2),    # Include bigrams
    max_features=10_000,
)

doc_matrix = vectorizer.fit_transform(documents)
query_vec = vectorizer.transform(["metformin diabetes glucose"])

similarities = cosine_similarity(query_vec, doc_matrix)[0]
ranked = sorted(enumerate(similarities), key=lambda x: -x[1])

for idx, score in ranked[:3]:
    print(f"[{score:.4f}] {documents[idx][:60]}")

Cosine similarity is preferred over raw TF-IDF sum — it normalizes for document length so long documents don't get an unfair advantage.

Implement TF-IDF from Scratch

The Problem

The Math

Step-by-Step Implementation

Testing It

Common Interview Follow-Ups

Using sklearn's Implementation

Enjoyed this article?

Leave a comment