Learnixo

AI/ML/NLP Research Track · Lesson 9 of 16

Project 1: Spam Detection

Project: Spam Detection with scikit-learn

Spam detection is one of the most important beginner ML projects. It teaches the full supervised-learning loop: data loading, preprocessing, feature engineering, model training, evaluation, and iteration. By the end you will have a reusable pipeline and an understanding of the trade-offs that appear in real production classifiers.

What you will build:

  • Binary text classifier (spam vs ham)
  • Reusable preprocessing and model pipeline
  • Evaluation report with precision, recall, and F1
  • Error analysis to understand failures
  • Threshold tuning for business constraints

Setup

Bash
pip install scikit-learn pandas matplotlib seaborn
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, precision_recall_curve
)

Step 1: Load and Explore the Data

Use the UCI SMS Spam Collection dataset — 5,574 messages labelled spam or ham.

Python
# Download from: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
df = pd.read_csv("spam.csv", encoding="latin-1", usecols=[0, 1])
df.columns = ["label", "text"]

print(df["label"].value_counts())
# ham     4825
# spam     747
# Imbalance: spam is ~13% of the dataset  important for evaluation

print(df["text"].str.len().describe())
# Spam messages are typically longer (URLs, prize claims, etc.)
Python
# Quick look at examples
print("SPAM examples:")
print(df[df["label"] == "spam"]["text"].head(3).values)

print("\nHAM examples:")
print(df[df["label"] == "ham"]["text"].head(3).values)

Step 2: Preprocessing

Python
import re

def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", " URL ", text)   # normalise URLs
    text = re.sub(r"\d+", " NUM ", text)               # normalise numbers
    text = re.sub(r"[^\w\s]", " ", text)               # remove punctuation
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["clean_text"] = df["text"].apply(clean_text)
df["label_bin"] = (df["label"] == "spam").astype(int)  # 1=spam, 0=ham

print(df[["text", "clean_text"]].head(3))

Step 3: Train/Test Split

Python
X = df["clean_text"]
y = df["label_bin"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y    # preserve class balance in both splits
)

print(f"Train: {len(X_train)} samples ({y_train.sum()} spam)")
print(f"Test:  {len(X_test)} samples ({y_test.sum()} spam)")

Step 4: Build the Pipeline

Python
# Logistic Regression with TF-IDF  strong baseline for text classification
lr_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=2,             # ignore terms appearing in fewer than 2 docs
        max_df=0.95,          # ignore terms appearing in >95% of docs (too common)
        sublinear_tf=True,    # apply log(1 + tf) scaling
    )),
    ("model", LogisticRegression(
        C=1.0,
        max_iter=1000,
        class_weight="balanced",   # upweight the minority spam class
    ))
])

lr_pipeline.fit(X_train, y_train)

Step 5: Evaluate

Python
y_pred = lr_pipeline.predict(X_test)
y_prob = lr_pipeline.predict_proba(X_test)[:, 1]   # probability of spam

print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       966
        spam       0.97      0.94      0.96       149

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115
Python
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=["ham", "spam"],
            yticklabels=["ham", "spam"])
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

# ROC-AUC
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

Step 6: Compare Multiple Models

Python
models = {
    "Logistic Regression": Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)),
        ("model", LogisticRegression(C=1.0, max_iter=1000, class_weight="balanced")),
    ]),
    "Naive Bayes": Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2)),
        ("model", MultinomialNB(alpha=0.1)),
    ]),
    "Linear SVM": Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)),
        ("model", LinearSVC(C=1.0, class_weight="balanced", max_iter=2000)),
    ]),
}

results = {}
for name, pipeline in models.items():
    # 5-fold cross-validation on training set
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="f1")
    results[name] = {"mean_f1": scores.mean(), "std_f1": scores.std()}
    print(f"{name}: F1 = {scores.mean():.4f} ± {scores.std():.4f}")
Logistic Regression: F1 = 0.9623 ± 0.0089
Naive Bayes:         F1 = 0.9451 ± 0.0112
Linear SVM:          F1 = 0.9681 ± 0.0076

Step 7: Error Analysis

Python
# Find misclassified messages
mask_fp = (y_pred == 1) & (y_test == 0)   # false positives (ham classified as spam)
mask_fn = (y_pred == 0) & (y_test == 1)   # false negatives (spam classified as ham)

fp_examples = X_test[mask_fp].values
fn_examples = X_test[mask_fn].values

print(f"False positives (ham → spam): {mask_fp.sum()}")
for ex in fp_examples[:5]:
    print(f"  '{ex[:80]}'")

print(f"\nFalse negatives (spam → ham): {mask_fn.sum()}")
for ex in fn_examples[:5]:
    print(f"  '{ex[:80]}'")
# Common false positive patterns:
#   Messages with many numbers (phone numbers, account numbers)
#   Marketing messages from legitimate companies
#   Messages containing words like "free" or "win" in ham context

# Common false negative patterns:
#   Sophisticated spam using proper grammar
#   Spam that avoids common trigger words
#   Very short spam messages with only a URL

Step 8: Threshold Tuning

The default threshold of 0.5 optimises for accuracy. For spam detection the business trade-off matters:

  • False positive (ham → spam): user misses a legitimate message. High cost.
  • False negative (spam → ham): user sees spam. Lower cost.
Python
# Find threshold that achieves 99% precision on spam
# (we'd rather miss some spam than flag legitimate messages)
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

target_precision = 0.99
idx = np.argmax(precision >= target_precision)
optimal_threshold = thresholds[idx]
print(f"At threshold {optimal_threshold:.3f}:")
print(f"  Precision: {precision[idx]:.4f}")
print(f"  Recall:    {recall[idx]:.4f}")

# Apply custom threshold
y_pred_strict = (y_prob >= optimal_threshold).astype(int)
print(classification_report(y_test, y_pred_strict, target_names=["ham", "spam"]))

Step 9: Inspect Feature Importance

Python
# Which words are most predictive of spam?
tfidf = lr_pipeline.named_steps["tfidf"]
model = lr_pipeline.named_steps["model"]

feature_names = tfidf.get_feature_names_out()
coef = model.coef_[0]

top_spam = np.argsort(coef)[-20:][::-1]
top_ham  = np.argsort(coef)[:20]

print("Top SPAM indicators:")
for i in top_spam:
    print(f"  {feature_names[i]:<25} {coef[i]:+.3f}")

print("\nTop HAM indicators:")
for i in top_ham:
    print(f"  {feature_names[i]:<25} {coef[i]:+.3f}")
Top SPAM indicators:
  free                       +3.421
  txt                        +2.987
  won                        +2.876
  prize                      +2.654
  claim                      +2.543
  ...

Top HAM indicators:
  gt                         -2.341   (common in chat transcripts)
  ok                         -2.109
  i ll                       -1.987
  ...

Step 10: Save and Load the Pipeline

Python
import joblib

# Save the fitted pipeline
joblib.dump(lr_pipeline, "spam_classifier.pkl")

# Load and use
loaded = joblib.load("spam_classifier.pkl")
test_messages = [
    "Congratulations! You've won a free iPhone. Call now.",
    "Hey, are we still meeting at 3pm today?",
]
predictions = loaded.predict(test_messages)
probabilities = loaded.predict_proba(test_messages)[:, 1]

for msg, pred, prob in zip(test_messages, predictions, probabilities):
    label = "SPAM" if pred == 1 else "HAM"
    print(f"[{label} {prob:.2f}] {msg[:60]}")

Real-World Improvements

Class imbalance: At 13% spam, standard accuracy is misleading. Always evaluate with F1 and ROC-AUC. Use class_weight="balanced" or oversample with SMOTE on the TF-IDF vectors.

Threshold for business risk: False positives (ham → spam) cost more than false negatives in most contexts. Tune threshold to the business constraint — 99% precision means you block almost no legitimate messages.

Feature engineering: Add message-level features alongside TF-IDF: character count, URL count, digit count, uppercase ratio. These capture spam signals that bag-of-words misses.

Temporal drift: Spammers adapt. Monitor your classifier's precision/recall on new messages weekly. Retrain quarterly or when precision drops below a threshold.


Deliverables Checklist

[ ] Notebook with data exploration (class distribution, message length analysis)
[ ] Preprocessing pipeline (cleaning function + rationale for each step)
[ ] Baseline model (Logistic Regression) with classification report
[ ] Model comparison table (at least 3 models with cross-validated F1)
[ ] Confusion matrix visualisation
[ ] Error analysis (10 false positives + 10 false negatives with explanations)
[ ] Threshold tuning section with precision-recall trade-off plot
[ ] Feature importance chart (top 20 spam and ham words)
[ ] Saved pipeline with load-and-predict demonstration
[ ] README with setup instructions and deployment considerations