Probability in Action: Spam Filter
A complete worked example applying joint, conditional, and Bayesian probability to build a spam classifier β showing all the calculations step by step.
The Problem
Build a spam filter. Given the words in an email, classify it as spam or not.
Training data:
5 spam emails
5 legitimate (ham) emails
Word counts observed:
Word | In spam | In ham
----------|---------|-------
"buy" | 4 | 1
"click" | 3 | 0
"meeting" | 1 | 4
"report" | 0 | 3
Total words in spam: 20
Total words in ham: 20
P(spam) = 5/10 = 0.50 (prior)Naive Bayes: The Model
Naive Bayes classifier applies Bayes' theorem with conditional independence:
P(spam | words) β P(spam) Γ Ξ P(wordα΅’ | spam)
For a new email containing "buy" and "click":
P(spam | buy, click) β P(spam) Γ P(buy | spam) Γ P(click | spam)
P(ham | buy, click) β P(ham) Γ P(buy | ham) Γ P(click | ham)Step-by-Step Calculation
import numpy as np
# Training counts
spam_word_counts = {"buy": 4, "click": 3, "meeting": 1, "report": 0}
ham_word_counts = {"buy": 1, "click": 0, "meeting": 4, "report": 3}
total_spam_words = 20
total_ham_words = 20
vocab_size = 4 # for Laplace smoothing denominator
# Laplace (add-1) smoothing to handle zero counts
# P(word | class) = (count + 1) / (total + vocab_size)
def word_probability(word: str, counts: dict, total: int, vocab: int) -> float:
return (counts.get(word, 0) + 1) / (total + vocab)
# Email: "buy click"
email_words = ["buy", "click"]
# P(spam)
p_spam = 0.50
p_ham = 0.50
# P(buy | spam) = (4+1)/(20+4) = 5/24 β 0.208
# P(click | spam) = (3+1)/(20+4) = 4/24 β 0.167
p_email_given_spam = (
p_spam
* word_probability("buy", spam_word_counts, total_spam_words, vocab_size)
* word_probability("click", spam_word_counts, total_spam_words, vocab_size)
)
print(f"P(spam) Γ P(buy|spam) Γ P(click|spam) = {p_email_given_spam:.6f}")
# 0.50 Γ 0.208 Γ 0.167 = 0.01740
# P(buy | ham) = (1+1)/(20+4) = 2/24 β 0.083
# P(click | ham) = (0+1)/(20+4) = 1/24 β 0.042
p_email_given_ham = (
p_ham
* word_probability("buy", ham_word_counts, total_ham_words, vocab_size)
* word_probability("click", ham_word_counts, total_ham_words, vocab_size)
)
print(f"P(ham) Γ P(buy|ham) Γ P(click|ham) = {p_email_given_ham:.6f}")
# 0.50 Γ 0.083 Γ 0.042 = 0.00174
# Normalise to get proper probabilities
p_total = p_email_given_spam + p_email_given_ham
p_spam_given_email = p_email_given_spam / p_total
p_ham_given_email = p_email_given_ham / p_total
print(f"\nP(spam | 'buy click') = {p_spam_given_email:.4f}") # ~0.909
print(f"P(ham | 'buy click') = {p_ham_given_email:.4f}") # ~0.091
print("Classification: SPAM β")Using Log Probabilities
Multiplying many small probabilities β numerical underflow. Use log probabilities instead:
import math
def naive_bayes_log_predict(
email_words: list[str],
spam_counts: dict,
ham_counts: dict,
total_spam: int,
total_ham: int,
p_spam: float = 0.5,
p_ham: float = 0.5,
) -> str:
vocab = set(spam_counts.keys()) | set(ham_counts.keys())
vocab_size = len(vocab)
log_p_spam = math.log(p_spam)
log_p_ham = math.log(p_ham)
for word in email_words:
log_p_spam += math.log(word_probability(word, spam_counts, total_spam, vocab_size))
log_p_ham += math.log(word_probability(word, ham_counts, total_ham, vocab_size))
return "spam" if log_p_spam > log_p_ham else "ham"
result = naive_bayes_log_predict(
email_words=["buy", "click"],
spam_counts=spam_word_counts,
ham_counts=ham_word_counts,
total_spam=total_spam_words,
total_ham=total_ham_words,
)
print(f"Classification: {result}") # spamUsing scikit-learn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Training emails
emails = [
"buy now click here",
"buy click get rich",
"click here buy cheap",
"meeting agenda report",
"quarterly report meeting",
]
labels = [1, 1, 1, 0, 0] # 1=spam, 0=ham
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
clf = MultinomialNB(alpha=1.0) # alpha=1 is Laplace smoothing
clf.fit(X, labels)
new_email = vectorizer.transform(["buy click promotion"])
proba = clf.predict_proba(new_email)[0]
print(f"P(ham) = {proba[0]:.4f}, P(spam) = {proba[1]:.4f}")
print(f"Prediction: {'spam' if proba[1] > 0.5 else 'ham'}")Why Laplace Smoothing?
Without smoothing: P("report" | spam) = 0/20 = 0
β P(email with "report" | spam) = 0 Γ anything = 0
β Any email containing "report" can NEVER be spam (mathematically impossible)
β This is wrong β we just haven't seen this word in training
With Laplace (add-1) smoothing:
P("report" | spam) = (0 + 1) / (20 + 4) = 1/24 β 0.042
β Small but non-zero probability
β "report" in an email provides weak evidence against spam, not certaintyInterview Answer
"A spam filter is the canonical Naive Bayes application. We apply Bayes' theorem: P(spam | words) β P(spam) Γ Ξ P(wordα΅’ | spam), where the conditional independence assumption lets us multiply individual word probabilities. We estimate P(word | class) from training data using Laplace smoothing (add 1 to all counts) to handle words not seen in training β without smoothing, any unseen word drives the joint probability to zero. In log space (to avoid underflow), this becomes: log P(spam | email) = log P(spam) + Ξ£ log P(wordα΅’ | spam). The class with the higher log probability wins."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.