Project: Spam Detection with scikit-learn

Spam detection is one of the most important beginner ML projects. It teaches the full supervised-learning loop: data loading, preprocessing, feature engineering, model training, evaluation, and iteration. By the end you will have a reusable pipeline and an understanding of the trade-offs that appear in real production classifiers.

What you will build:

Binary text classifier (spam vs ham)
Reusable preprocessing and model pipeline
Evaluation report with precision, recall, and F1
Error analysis to understand failures
Threshold tuning for business constraints

Setup

Bash

pip install scikit-learn pandas matplotlib seaborn

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, precision_recall_curve
)

Step 1: Load and Explore the Data

Use the UCI SMS Spam Collection dataset — 5,574 messages labelled spam or ham.

Python

# Download from: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
df = pd.read_csv("spam.csv", encoding="latin-1", usecols=[0, 1])
df.columns = ["label", "text"]

print(df["label"].value_counts())
# ham     4825
# spam     747
# Imbalance: spam is ~13% of the dataset — important for evaluation

print(df["text"].str.len().describe())
# Spam messages are typically longer (URLs, prize claims, etc.)

Python

# Quick look at examples
print("SPAM examples:")
print(df[df["label"] == "spam"]["text"].head(3).values)

print("\nHAM examples:")
print(df[df["label"] == "ham"]["text"].head(3).values)

Step 2: Preprocessing

Python

import re

def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", " URL ", text)   # normalise URLs
    text = re.sub(r"\d+", " NUM ", text)               # normalise numbers
    text = re.sub(r"[^\w\s]", " ", text)               # remove punctuation
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["clean_text"] = df["text"].apply(clean_text)
df["label_bin"] = (df["label"] == "spam").astype(int)  # 1=spam, 0=ham

print(df[["text", "clean_text"]].head(3))

Step 3: Train/Test Split

Python

X = df["clean_text"]
y = df["label_bin"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y    # preserve class balance in both splits
)

print(f"Train: {len(X_train)} samples ({y_train.sum()} spam)")
print(f"Test:  {len(X_test)} samples ({y_test.sum()} spam)")

Step 4: Build the Pipeline

Python

# Logistic Regression with TF-IDF — strong baseline for text classification
lr_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=2,             # ignore terms appearing in fewer than 2 docs
        max_df=0.95,          # ignore terms appearing in >95% of docs (too common)
        sublinear_tf=True,    # apply log(1 + tf) scaling
    )),
    ("model", LogisticRegression(
        C=1.0,
        max_iter=1000,
        class_weight="balanced",   # upweight the minority spam class
    ))
])

lr_pipeline.fit(X_train, y_train)

Step 5: Evaluate

Python

y_pred = lr_pipeline.predict(X_test)
y_prob = lr_pipeline.predict_proba(X_test)[:, 1]   # probability of spam

print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       966
        spam       0.97      0.94      0.96       149

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

Python

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=["ham", "spam"],
            yticklabels=["ham", "spam"])
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

# ROC-AUC
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

Step 6: Compare Multiple Models

Python

models = {
    "Logistic Regression": Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)),
        ("model", LogisticRegression(C=1.0, max_iter=1000, class_weight="balanced")),
    ]),
    "Naive Bayes": Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2)),
        ("model", MultinomialNB(alpha=0.1)),
    ]),
    "Linear SVM": Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)),
        ("model", LinearSVC(C=1.0, class_weight="balanced", max_iter=2000)),
    ]),
}

results = {}
for name, pipeline in models.items():
    # 5-fold cross-validation on training set
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="f1")
    results[name] = {"mean_f1": scores.mean(), "std_f1": scores.std()}
    print(f"{name}: F1 = {scores.mean():.4f} ± {scores.std():.4f}")

Logistic Regression: F1 = 0.9623 ± 0.0089
Naive Bayes:         F1 = 0.9451 ± 0.0112
Linear SVM:          F1 = 0.9681 ± 0.0076

Step 7: Error Analysis

Python

# Find misclassified messages
mask_fp = (y_pred == 1) & (y_test == 0)   # false positives (ham classified as spam)
mask_fn = (y_pred == 0) & (y_test == 1)   # false negatives (spam classified as ham)

fp_examples = X_test[mask_fp].values
fn_examples = X_test[mask_fn].values

print(f"False positives (ham → spam): {mask_fp.sum()}")
for ex in fp_examples[:5]:
    print(f"  '{ex[:80]}'")

print(f"\nFalse negatives (spam → ham): {mask_fn.sum()}")
for ex in fn_examples[:5]:
    print(f"  '{ex[:80]}'")

# Common false positive patterns:
#   Messages with many numbers (phone numbers, account numbers)
#   Marketing messages from legitimate companies
#   Messages containing words like "free" or "win" in ham context

# Common false negative patterns:
#   Sophisticated spam using proper grammar
#   Spam that avoids common trigger words
#   Very short spam messages with only a URL

Step 8: Threshold Tuning

The default threshold of 0.5 optimises for accuracy. For spam detection the business trade-off matters:

False positive (ham → spam): user misses a legitimate message. High cost.
False negative (spam → ham): user sees spam. Lower cost.

Python

# Find threshold that achieves 99% precision on spam
# (we'd rather miss some spam than flag legitimate messages)
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

target_precision = 0.99
idx = np.argmax(precision >= target_precision)
optimal_threshold = thresholds[idx]
print(f"At threshold {optimal_threshold:.3f}:")
print(f"  Precision: {precision[idx]:.4f}")
print(f"  Recall:    {recall[idx]:.4f}")

# Apply custom threshold
y_pred_strict = (y_prob >= optimal_threshold).astype(int)
print(classification_report(y_test, y_pred_strict, target_names=["ham", "spam"]))

Step 9: Inspect Feature Importance

Python

# Which words are most predictive of spam?
tfidf = lr_pipeline.named_steps["tfidf"]
model = lr_pipeline.named_steps["model"]

feature_names = tfidf.get_feature_names_out()
coef = model.coef_[0]

top_spam = np.argsort(coef)[-20:][::-1]
top_ham  = np.argsort(coef)[:20]

print("Top SPAM indicators:")
for i in top_spam:
    print(f"  {feature_names[i]:<25} {coef[i]:+.3f}")

print("\nTop HAM indicators:")
for i in top_ham:
    print(f"  {feature_names[i]:<25} {coef[i]:+.3f}")

Top SPAM indicators:
  free                       +3.421
  txt                        +2.987
  won                        +2.876
  prize                      +2.654
  claim                      +2.543
  ...

Top HAM indicators:
  gt                         -2.341   (common in chat transcripts)
  ok                         -2.109
  i ll                       -1.987
  ...

Step 10: Save and Load the Pipeline

Python

import joblib

# Save the fitted pipeline
joblib.dump(lr_pipeline, "spam_classifier.pkl")

# Load and use
loaded = joblib.load("spam_classifier.pkl")
test_messages = [
    "Congratulations! You've won a free iPhone. Call now.",
    "Hey, are we still meeting at 3pm today?",
]
predictions = loaded.predict(test_messages)
probabilities = loaded.predict_proba(test_messages)[:, 1]

for msg, pred, prob in zip(test_messages, predictions, probabilities):
    label = "SPAM" if pred == 1 else "HAM"
    print(f"[{label} {prob:.2f}] {msg[:60]}")

Real-World Improvements

Class imbalance: At 13% spam, standard accuracy is misleading. Always evaluate with F1 and ROC-AUC. Use class_weight="balanced" or oversample with SMOTE on the TF-IDF vectors.

Threshold for business risk: False positives (ham → spam) cost more than false negatives in most contexts. Tune threshold to the business constraint — 99% precision means you block almost no legitimate messages.

Feature engineering: Add message-level features alongside TF-IDF: character count, URL count, digit count, uppercase ratio. These capture spam signals that bag-of-words misses.

Temporal drift: Spammers adapt. Monitor your classifier's precision/recall on new messages weekly. Retrain quarterly or when precision drops below a threshold.

Deliverables Checklist

[ ] Notebook with data exploration (class distribution, message length analysis)
[ ] Preprocessing pipeline (cleaning function + rationale for each step)
[ ] Baseline model (Logistic Regression) with classification report
[ ] Model comparison table (at least 3 models with cross-validated F1)
[ ] Confusion matrix visualisation
[ ] Error analysis (10 false positives + 10 false negatives with explanations)
[ ] Threshold tuning section with precision-recall trade-off plot
[ ] Feature importance chart (top 20 spam and ham words)
[ ] Saved pipeline with load-and-predict demonstration
[ ] README with setup instructions and deployment considerations

Project 1: Spam Detection

Project: Spam Detection with scikit-learn

Setup

Step 1: Load and Explore the Data

Step 2: Preprocessing

Step 3: Train/Test Split

Step 4: Build the Pipeline

Step 5: Evaluate

Step 6: Compare Multiple Models

Step 7: Error Analysis

Step 8: Threshold Tuning

Step 9: Inspect Feature Importance

Step 10: Save and Load the Pipeline

Real-World Improvements

Deliverables Checklist