Learnixo
Back to blog
AI Systemsintermediate

Implement TF-IDF from Scratch

Implement TF-IDF (Term Frequency-Inverse Document Frequency) in Python from scratch. Understand the math, code it step by step, and see how it powers keyword search.

Asma Hafeez KhanMay 16, 20264 min read
Live CodingTF-IDFInformation RetrievalPython
Share:𝕏

The Problem

TF-IDF is a classic technique for scoring how relevant a word is to a document in a collection. It's the backbone of keyword search and appears frequently in AI interviews.

Task: Given a list of documents, implement TF-IDF to find the most relevant document for a query.


The Math

Term Frequency (TF): How often a term appears in a document, normalized by document length.

TF(t, d) = count(t in d) / len(d)

Inverse Document Frequency (IDF): How rare a term is across all documents. Common words (the, and) get low IDF; rare domain words get high IDF.

IDF(t, D) = log(N / (1 + df(t)))
where N = total documents, df(t) = documents containing t

TF-IDF score:

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

Step-by-Step Implementation

Python
import math
from collections import Counter

class TFIDF:
    def __init__(self):
        self.idf: dict[str, float] = {}
        self.documents: list[list[str]] = []

    def tokenize(self, text: str) -> list[str]:
        """Simple whitespace tokenizer with lowercasing."""
        return text.lower().split()

    def fit(self, documents: list[str]):
        """Compute IDF from a corpus of documents."""
        self.documents = [self.tokenize(doc) for doc in documents]
        N = len(self.documents)

        # Count how many documents contain each term
        df: dict[str, int] = Counter()
        for doc_tokens in self.documents:
            for term in set(doc_tokens):  # set() to count each term once per doc
                df[term] += 1

        # Compute IDF for each term
        # +1 in denominator prevents division by zero
        self.idf = {term: math.log(N / (1 + count)) for term, count in df.items()}

    def tf(self, term: str, doc_tokens: list[str]) -> float:
        """Compute TF for a term in a document."""
        count = doc_tokens.count(term)
        return count / len(doc_tokens) if doc_tokens else 0.0

    def tfidf_vector(self, doc_tokens: list[str]) -> dict[str, float]:
        """Compute TF-IDF vector for a tokenized document."""
        return {
            term: self.tf(term, doc_tokens) * self.idf.get(term, 0.0)
            for term in set(doc_tokens)
        }

    def query(self, query_text: str, documents: list[str]) -> list[tuple[int, float]]:
        """Return (doc_index, score) pairs sorted by relevance."""
        self.fit(documents)

        query_terms = self.tokenize(query_text)

        scores = []
        for idx, doc_tokens in enumerate(self.documents):
            # Score = sum of TF-IDF for each query term in this document
            score = sum(
                self.tf(term, doc_tokens) * self.idf.get(term, 0.0)
                for term in query_terms
            )
            scores.append((idx, score))

        return sorted(scores, key=lambda x: -x[1])

Testing It

Python
documents = [
    "Metformin inhibits hepatic glucose production via AMPK activation",
    "Warfarin inhibits vitamin K epoxide reductase reducing clotting factors",
    "Ibuprofen inhibits COX-1 and COX-2 reducing prostaglandin synthesis",
    "Metformin is used for type 2 diabetes with low hypoglycemia risk",
    "Aspirin inhibits COX-1 reducing thromboxane and platelet aggregation",
]

tfidf = TFIDF()
results = tfidf.query("metformin diabetes glucose", documents)

print("Query: metformin diabetes glucose")
for idx, score in results:
    print(f"  [{score:.4f}] {documents[idx][:60]}")

# Expected: documents 0 and 3 (both mention metformin) should rank highest

Output:

Query: metformin diabetes glucose
  [0.0821] Metformin inhibits hepatic glucose production via AMPK...
  [0.0614] Metformin is used for type 2 diabetes with low hypoglycemia...
  [0.0000] Warfarin inhibits vitamin K epoxide reductase reducing...

Common Interview Follow-Ups

Q: Why does TF-IDF work better than pure term frequency?

TF alone favors documents where a term appears many times — but if the term appears in every document, it carries no discriminating information. IDF penalizes common terms, so "the" doesn't score highly just because it appears often.

Q: What are the limitations of TF-IDF?

  • No semantic understanding: "ibuprofen" and "NSAID" are unrelated in TF-IDF vocabulary
  • No context: can't distinguish "drug does NOT cause X" from "drug causes X"
  • Vocabulary mismatch: query uses different words than documents

These limitations are exactly why dense vector embeddings (SBERT, OpenAI text-embedding) replaced TF-IDF for semantic search.

Q: How would you handle document length normalization?

Cosine similarity between document TF-IDF vectors naturally normalizes for length. Or use sublinear TF scaling: 1 + log(count) instead of raw count.


Using sklearn's Implementation

Python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",  # Remove common words
    ngram_range=(1, 2),    # Include bigrams
    max_features=10_000,
)

doc_matrix = vectorizer.fit_transform(documents)
query_vec = vectorizer.transform(["metformin diabetes glucose"])

similarities = cosine_similarity(query_vec, doc_matrix)[0]
ranked = sorted(enumerate(similarities), key=lambda x: -x[1])

for idx, score in ranked[:3]:
    print(f"[{score:.4f}] {documents[idx][:60]}")

Cosine similarity is preferred over raw TF-IDF sum — it normalizes for document length so long documents don't get an unfair advantage.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.