Probability in Action: Spam Filter

The Problem

Build a spam filter. Given the words in an email, classify it as spam or not.

Training data:
  5 spam emails
  5 legitimate (ham) emails
  
  Word counts observed:
  
  Word      | In spam | In ham
  ----------|---------|-------
  "buy"     |    4    |   1
  "click"   |    3    |   0
  "meeting" |    1    |   4
  "report"  |    0    |   3
  
  Total words in spam:  20
  Total words in ham:   20
  P(spam) = 5/10 = 0.50 (prior)

Naive Bayes: The Model

Naive Bayes classifier applies Bayes' theorem with conditional independence:

P(spam | words) ∝ P(spam) × Π P(wordᵢ | spam)

For a new email containing "buy" and "click":
  P(spam | buy, click) ∝ P(spam) × P(buy | spam) × P(click | spam)
  P(ham  | buy, click) ∝ P(ham)  × P(buy | ham)  × P(click | ham)

Step-by-Step Calculation

Python

import numpy as np

# Training counts
spam_word_counts = {"buy": 4, "click": 3, "meeting": 1, "report": 0}
ham_word_counts  = {"buy": 1, "click": 0, "meeting": 4, "report": 3}

total_spam_words = 20
total_ham_words  = 20
vocab_size = 4     # for Laplace smoothing denominator

# Laplace (add-1) smoothing to handle zero counts
# P(word | class) = (count + 1) / (total + vocab_size)
def word_probability(word: str, counts: dict, total: int, vocab: int) -> float:
    return (counts.get(word, 0) + 1) / (total + vocab)

# Email: "buy click"
email_words = ["buy", "click"]

# P(spam)
p_spam = 0.50
p_ham  = 0.50

# P(buy | spam) = (4+1)/(20+4) = 5/24 ≈ 0.208
# P(click | spam) = (3+1)/(20+4) = 4/24 ≈ 0.167
p_email_given_spam = (
    p_spam
    * word_probability("buy", spam_word_counts, total_spam_words, vocab_size)
    * word_probability("click", spam_word_counts, total_spam_words, vocab_size)
)
print(f"P(spam) × P(buy|spam) × P(click|spam) = {p_email_given_spam:.6f}")
# 0.50 × 0.208 × 0.167 = 0.01740

# P(buy | ham) = (1+1)/(20+4) = 2/24 ≈ 0.083
# P(click | ham) = (0+1)/(20+4) = 1/24 ≈ 0.042
p_email_given_ham = (
    p_ham
    * word_probability("buy", ham_word_counts, total_ham_words, vocab_size)
    * word_probability("click", ham_word_counts, total_ham_words, vocab_size)
)
print(f"P(ham) × P(buy|ham) × P(click|ham) = {p_email_given_ham:.6f}")
# 0.50 × 0.083 × 0.042 = 0.00174

# Normalise to get proper probabilities
p_total = p_email_given_spam + p_email_given_ham
p_spam_given_email = p_email_given_spam / p_total
p_ham_given_email  = p_email_given_ham  / p_total

print(f"\nP(spam | 'buy click') = {p_spam_given_email:.4f}")  # ~0.909
print(f"P(ham  | 'buy click') = {p_ham_given_email:.4f}")   # ~0.091
print("Classification: SPAM ✓")

Using Log Probabilities

Multiplying many small probabilities → numerical underflow. Use log probabilities instead:

Python

import math

def naive_bayes_log_predict(
    email_words: list[str],
    spam_counts: dict,
    ham_counts: dict,
    total_spam: int,
    total_ham: int,
    p_spam: float = 0.5,
    p_ham: float = 0.5,
) -> str:
    vocab = set(spam_counts.keys()) | set(ham_counts.keys())
    vocab_size = len(vocab)
    
    log_p_spam = math.log(p_spam)
    log_p_ham  = math.log(p_ham)
    
    for word in email_words:
        log_p_spam += math.log(word_probability(word, spam_counts, total_spam, vocab_size))
        log_p_ham  += math.log(word_probability(word, ham_counts, total_ham, vocab_size))
    
    return "spam" if log_p_spam > log_p_ham else "ham"

result = naive_bayes_log_predict(
    email_words=["buy", "click"],
    spam_counts=spam_word_counts,
    ham_counts=ham_word_counts,
    total_spam=total_spam_words,
    total_ham=total_ham_words,
)
print(f"Classification: {result}")  # spam

Using scikit-learn

Python

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Training emails
emails = [
    "buy now click here",
    "buy click get rich",
    "click here buy cheap",
    "meeting agenda report",
    "quarterly report meeting",
]
labels = [1, 1, 1, 0, 0]  # 1=spam, 0=ham

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

clf = MultinomialNB(alpha=1.0)  # alpha=1 is Laplace smoothing
clf.fit(X, labels)

new_email = vectorizer.transform(["buy click promotion"])
proba = clf.predict_proba(new_email)[0]
print(f"P(ham) = {proba[0]:.4f}, P(spam) = {proba[1]:.4f}")
print(f"Prediction: {'spam' if proba[1] > 0.5 else 'ham'}")

Why Laplace Smoothing?

Without smoothing: P("report" | spam) = 0/20 = 0
  → P(email with "report" | spam) = 0 × anything = 0
  → Any email containing "report" can NEVER be spam (mathematically impossible)
  → This is wrong — we just haven't seen this word in training

With Laplace (add-1) smoothing:
  P("report" | spam) = (0 + 1) / (20 + 4) = 1/24 ≈ 0.042
  → Small but non-zero probability
  → "report" in an email provides weak evidence against spam, not certainty

Interview Answer

"A spam filter is the canonical Naive Bayes application. We apply Bayes' theorem: P(spam | words) ∝ P(spam) × Π P(wordᵢ | spam), where the conditional independence assumption lets us multiply individual word probabilities. We estimate P(word | class) from training data using Laplace smoothing (add 1 to all counts) to handle words not seen in training — without smoothing, any unseen word drives the joint probability to zero. In log space (to avoid underflow), this becomes: log P(spam | email) = log P(spam) + Σ log P(wordᵢ | spam). The class with the higher log probability wins."

Probability in Action: Spam Filter

The Problem

Naive Bayes: The Model

Step-by-Step Calculation

Using Log Probabilities

Using scikit-learn

Why Laplace Smoothing?

Interview Answer

Enjoyed this article?

Leave a comment