AI/ML/NLP Research Track · Lesson 9 of 16
Project 1: Spam Detection
Project: Spam Detection with scikit-learn
Spam detection is one of the most important beginner ML projects. It teaches the full supervised-learning loop: data loading, preprocessing, feature engineering, model training, evaluation, and iteration. By the end you will have a reusable pipeline and an understanding of the trade-offs that appear in real production classifiers.
What you will build:
- Binary text classifier (
spamvsham) - Reusable preprocessing and model pipeline
- Evaluation report with precision, recall, and F1
- Error analysis to understand failures
- Threshold tuning for business constraints
Setup
pip install scikit-learn pandas matplotlib seabornimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, precision_recall_curve
)Step 1: Load and Explore the Data
Use the UCI SMS Spam Collection dataset — 5,574 messages labelled spam or ham.
# Download from: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
df = pd.read_csv("spam.csv", encoding="latin-1", usecols=[0, 1])
df.columns = ["label", "text"]
print(df["label"].value_counts())
# ham 4825
# spam 747
# Imbalance: spam is ~13% of the dataset — important for evaluation
print(df["text"].str.len().describe())
# Spam messages are typically longer (URLs, prize claims, etc.)# Quick look at examples
print("SPAM examples:")
print(df[df["label"] == "spam"]["text"].head(3).values)
print("\nHAM examples:")
print(df[df["label"] == "ham"]["text"].head(3).values)Step 2: Preprocessing
import re
def clean_text(text: str) -> str:
text = text.lower()
text = re.sub(r"http\S+|www\S+", " URL ", text) # normalise URLs
text = re.sub(r"\d+", " NUM ", text) # normalise numbers
text = re.sub(r"[^\w\s]", " ", text) # remove punctuation
text = re.sub(r"\s+", " ", text).strip()
return text
df["clean_text"] = df["text"].apply(clean_text)
df["label_bin"] = (df["label"] == "spam").astype(int) # 1=spam, 0=ham
print(df[["text", "clean_text"]].head(3))Step 3: Train/Test Split
X = df["clean_text"]
y = df["label_bin"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # preserve class balance in both splits
)
print(f"Train: {len(X_train)} samples ({y_train.sum()} spam)")
print(f"Test: {len(X_test)} samples ({y_test.sum()} spam)")Step 4: Build the Pipeline
# Logistic Regression with TF-IDF — strong baseline for text classification
lr_pipeline = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1, 2), # unigrams + bigrams
min_df=2, # ignore terms appearing in fewer than 2 docs
max_df=0.95, # ignore terms appearing in >95% of docs (too common)
sublinear_tf=True, # apply log(1 + tf) scaling
)),
("model", LogisticRegression(
C=1.0,
max_iter=1000,
class_weight="balanced", # upweight the minority spam class
))
])
lr_pipeline.fit(X_train, y_train)Step 5: Evaluate
y_pred = lr_pipeline.predict(X_test)
y_prob = lr_pipeline.predict_proba(X_test)[:, 1] # probability of spam
print(classification_report(y_test, y_pred, target_names=["ham", "spam"])) precision recall f1-score support
ham 0.99 0.99 0.99 966
spam 0.97 0.94 0.96 149
accuracy 0.99 1115
macro avg 0.98 0.97 0.97 1115
weighted avg 0.99 0.99 0.99 1115# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d",
xticklabels=["ham", "spam"],
yticklabels=["ham", "spam"])
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
# ROC-AUC
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")Step 6: Compare Multiple Models
models = {
"Logistic Regression": Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)),
("model", LogisticRegression(C=1.0, max_iter=1000, class_weight="balanced")),
]),
"Naive Bayes": Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2)),
("model", MultinomialNB(alpha=0.1)),
]),
"Linear SVM": Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)),
("model", LinearSVC(C=1.0, class_weight="balanced", max_iter=2000)),
]),
}
results = {}
for name, pipeline in models.items():
# 5-fold cross-validation on training set
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="f1")
results[name] = {"mean_f1": scores.mean(), "std_f1": scores.std()}
print(f"{name}: F1 = {scores.mean():.4f} ± {scores.std():.4f}")Logistic Regression: F1 = 0.9623 ± 0.0089
Naive Bayes: F1 = 0.9451 ± 0.0112
Linear SVM: F1 = 0.9681 ± 0.0076Step 7: Error Analysis
# Find misclassified messages
mask_fp = (y_pred == 1) & (y_test == 0) # false positives (ham classified as spam)
mask_fn = (y_pred == 0) & (y_test == 1) # false negatives (spam classified as ham)
fp_examples = X_test[mask_fp].values
fn_examples = X_test[mask_fn].values
print(f"False positives (ham → spam): {mask_fp.sum()}")
for ex in fp_examples[:5]:
print(f" '{ex[:80]}'")
print(f"\nFalse negatives (spam → ham): {mask_fn.sum()}")
for ex in fn_examples[:5]:
print(f" '{ex[:80]}'")# Common false positive patterns:
# Messages with many numbers (phone numbers, account numbers)
# Marketing messages from legitimate companies
# Messages containing words like "free" or "win" in ham context
# Common false negative patterns:
# Sophisticated spam using proper grammar
# Spam that avoids common trigger words
# Very short spam messages with only a URLStep 8: Threshold Tuning
The default threshold of 0.5 optimises for accuracy. For spam detection the business trade-off matters:
- False positive (ham → spam): user misses a legitimate message. High cost.
- False negative (spam → ham): user sees spam. Lower cost.
# Find threshold that achieves 99% precision on spam
# (we'd rather miss some spam than flag legitimate messages)
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
target_precision = 0.99
idx = np.argmax(precision >= target_precision)
optimal_threshold = thresholds[idx]
print(f"At threshold {optimal_threshold:.3f}:")
print(f" Precision: {precision[idx]:.4f}")
print(f" Recall: {recall[idx]:.4f}")
# Apply custom threshold
y_pred_strict = (y_prob >= optimal_threshold).astype(int)
print(classification_report(y_test, y_pred_strict, target_names=["ham", "spam"]))Step 9: Inspect Feature Importance
# Which words are most predictive of spam?
tfidf = lr_pipeline.named_steps["tfidf"]
model = lr_pipeline.named_steps["model"]
feature_names = tfidf.get_feature_names_out()
coef = model.coef_[0]
top_spam = np.argsort(coef)[-20:][::-1]
top_ham = np.argsort(coef)[:20]
print("Top SPAM indicators:")
for i in top_spam:
print(f" {feature_names[i]:<25} {coef[i]:+.3f}")
print("\nTop HAM indicators:")
for i in top_ham:
print(f" {feature_names[i]:<25} {coef[i]:+.3f}")Top SPAM indicators:
free +3.421
txt +2.987
won +2.876
prize +2.654
claim +2.543
...
Top HAM indicators:
gt -2.341 (common in chat transcripts)
ok -2.109
i ll -1.987
...Step 10: Save and Load the Pipeline
import joblib
# Save the fitted pipeline
joblib.dump(lr_pipeline, "spam_classifier.pkl")
# Load and use
loaded = joblib.load("spam_classifier.pkl")
test_messages = [
"Congratulations! You've won a free iPhone. Call now.",
"Hey, are we still meeting at 3pm today?",
]
predictions = loaded.predict(test_messages)
probabilities = loaded.predict_proba(test_messages)[:, 1]
for msg, pred, prob in zip(test_messages, predictions, probabilities):
label = "SPAM" if pred == 1 else "HAM"
print(f"[{label} {prob:.2f}] {msg[:60]}")Real-World Improvements
Class imbalance: At 13% spam, standard accuracy is misleading. Always evaluate with F1 and ROC-AUC. Use class_weight="balanced" or oversample with SMOTE on the TF-IDF vectors.
Threshold for business risk: False positives (ham → spam) cost more than false negatives in most contexts. Tune threshold to the business constraint — 99% precision means you block almost no legitimate messages.
Feature engineering: Add message-level features alongside TF-IDF: character count, URL count, digit count, uppercase ratio. These capture spam signals that bag-of-words misses.
Temporal drift: Spammers adapt. Monitor your classifier's precision/recall on new messages weekly. Retrain quarterly or when precision drops below a threshold.
Deliverables Checklist
[ ] Notebook with data exploration (class distribution, message length analysis)
[ ] Preprocessing pipeline (cleaning function + rationale for each step)
[ ] Baseline model (Logistic Regression) with classification report
[ ] Model comparison table (at least 3 models with cross-validated F1)
[ ] Confusion matrix visualisation
[ ] Error analysis (10 false positives + 10 false negatives with explanations)
[ ] Threshold tuning section with precision-recall trade-off plot
[ ] Feature importance chart (top 20 spam and ham words)
[ ] Saved pipeline with load-and-predict demonstration
[ ] README with setup instructions and deployment considerations