Project: Movie Sentiment Analysis with scikit-learn

This project moves beyond toy classification into the messy reality of user-generated text. Movie reviews contain sarcasm, slang, mixed opinions, and non-standard language — exactly the kind of noise you encounter in production sentiment analysis. By the end you will understand not just how to train a classifier but how to interpret what it got wrong and why.

What you will build:

Positive/negative sentiment classifier on IMDB reviews
Preprocessing pipeline for noisy text
Comparison of multiple models with cross-validation
Error analysis on misclassified reviews
Feature importance interpretation

Setup

Bash

pip install scikit-learn pandas numpy matplotlib seaborn

Python

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, roc_curve
)

Step 1: Load the IMDB Dataset

Python

# IMDB Large Movie Review Dataset — 50,000 reviews, balanced 50/50
# Download from: https://ai.stanford.edu/~amaas/data/sentiment/
# Or use the Hugging Face datasets version:
# from datasets import load_dataset
# dataset = load_dataset("imdb")

df = pd.read_csv("imdb_reviews.csv")   # expects columns: review, sentiment
# sentiment: "positive" or "negative"

print(df["sentiment"].value_counts())
# positive    25000
# negative    25000

print(f"Average review length: {df['review'].str.split().str.len().mean():.0f} words")
# Average review length: 233 words — much longer than SMS, needs different strategy

Step 2: Text Preprocessing

Python

def clean_review(text: str) -> str:
    # Remove HTML tags (IMDB reviews often contain <br /> tags)
    text = re.sub(r"<[^>]+>", " ", text)
    # Normalise URLs
    text = re.sub(r"http\S+|www\S+", " URL ", text)
    # Expand common contractions
    contractions = {
        "n't": " not", "'re": " are", "'ve": " have",
        "'ll": " will", "'d": " would", "'m": " am",
    }
    for pattern, replacement in contractions.items():
        text = text.replace(pattern, replacement)
    # Lowercase
    text = text.lower()
    # Remove punctuation except apostrophes
    text = re.sub(r"[^\w\s]", " ", text)
    # Collapse whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["clean_review"] = df["review"].apply(clean_review)
df["label"] = (df["sentiment"] == "positive").astype(int)

# Sanity check
print("Original:", df["review"].iloc[0][:100])
print("Cleaned: ", df["clean_review"].iloc[0][:100])

Step 3: Train/Test Split

Python

X = df["clean_review"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Train: {len(X_train)} reviews")
print(f"Test:  {len(X_test)} reviews")

Step 4: Baseline — Logistic Regression with TF-IDF

Python

baseline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=50_000,
        ngram_range=(1, 2),
        sublinear_tf=True,
        stop_words="english",
    )),
    ("model", LogisticRegression(C=1.0, max_iter=1000, solver="lbfgs"))
])

baseline.fit(X_train, y_train)
y_pred_base = baseline.predict(X_test)
y_prob_base = baseline.predict_proba(X_test)[:, 1]

print("Baseline (LR + TF-IDF):")
print(classification_report(y_test, y_pred_base, target_names=["negative", "positive"]))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob_base):.4f}")

              precision    recall  f1-score   support

    negative       0.90      0.89      0.89      5000
    positive       0.89      0.90      0.89      5000

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000

ROC-AUC: 0.9621

Step 5: Compare Models

Python

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    "Logistic Regression": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2), sublinear_tf=True)),
        ("model", LogisticRegression(C=1.0, max_iter=1000)),
    ]),
    "Linear SVM": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2), sublinear_tf=True)),
        ("model", LinearSVC(C=1.0, max_iter=2000)),
    ]),
    "Naive Bayes": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2))),
        ("model", MultinomialNB(alpha=0.05)),
    ]),
    "LR + Character N-grams": Pipeline([
        ("tfidf", TfidfVectorizer(
            max_features=50_000,
            analyzer="char_wb",   # character n-grams within word boundaries
            ngram_range=(3, 5),
            sublinear_tf=True,
        )),
        ("model", LogisticRegression(C=1.0, max_iter=1000)),
    ]),
}

print(f"{'Model':<30} {'Mean F1':>10} {'Std F1':>10}")
print("-" * 52)
for name, pipeline in models.items():
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring="f1_macro")
    print(f"{name:<30} {scores.mean():>10.4f} {scores.std():>10.4f}")

Model                          Mean F1     Std F1
----------------------------------------------------
Logistic Regression             0.8921     0.0031
Linear SVM                      0.8972     0.0028
Naive Bayes                     0.8612     0.0041
LR + Character N-grams          0.8843     0.0035

Step 6: Error Analysis

Python

best_model = models["Linear SVM"]
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

mask_fp = (y_pred == 1) & (y_test == 0)   # predicted positive, actually negative
mask_fn = (y_pred == 0) & (y_test == 1)   # predicted negative, actually positive

print(f"False positives (neg→pos): {mask_fp.sum()}")
print(f"False negatives (pos→neg): {mask_fn.sum()}")

Python

# Inspect false positives — reviews the model thought were positive but were negative
fp_reviews = X_test[mask_fp].values
print("\nFalse Positive Examples (predicted positive, actually negative):")
for r in fp_reviews[:5]:
    print(f"\n  '{r[:150]}'")

False Positive Examples:
  'the acting was magnificent and the cinematography was stunning but the plot made absolutely no sense and left me deeply unsatisfied'
  # Problem: many positive words + negative conclusion — model weights individual words, not structure

  'great movie if you enjoy watching paint dry for two hours'
  # Problem: sarcasm — "great" is positive but the phrase is negative

Python

# Inspect false negatives — reviews the model thought were negative but were positive
fn_reviews = X_test[mask_fn].values
print("\nFalse Negative Examples (predicted negative, actually positive):")
for r in fn_reviews[:5]:
    print(f"\n  '{r[:150]}'")

False Negative Examples:
  'not the worst film i ve seen this year it actually exceeded my low expectations'
  # Problem: "not the worst" is positive but loaded with negative words

  'i went in expecting garbage and came out pleasantly surprised'
  # Problem: negative framing of a positive sentiment

Step 7: Feature Importance

Python

# Refit Logistic Regression for interpretability
lr = models["Logistic Regression"]
lr.fit(X_train, y_train)

tfidf    = lr.named_steps["tfidf"]
clf      = lr.named_steps["model"]
features = tfidf.get_feature_names_out()
coef     = clf.coef_[0]

# Most positive and negative indicators
top_pos = np.argsort(coef)[-20:][::-1]
top_neg = np.argsort(coef)[:20]

print("Top POSITIVE indicators:")
for i in top_pos:
    print(f"  {features[i]:<30} {coef[i]:+.3f}")

print("\nTop NEGATIVE indicators:")
for i in top_neg:
    print(f"  {features[i]:<30} {coef[i]:+.3f}")

Top POSITIVE indicators:
  excellent                        +4.231
  wonderful                        +3.987
  masterpiece                      +3.654
  beautifully                      +3.421
  brilliant                        +3.312
  ...

Top NEGATIVE indicators:
  worst                            -4.876
  awful                            -4.321
  terrible                         -4.109
  boring                           -3.987
  waste                            -3.654
  ...

# Note: "not bad" would have both "not" (-3.1) and "bad" (-3.8)
# The model scores them independently, missing the negation structure

Step 8: Confusion Matrix Visualisation

Python

cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(6, 5))

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Predicted Negative", "Predicted Positive"],
            yticklabels=["Actual Negative", "Actual Positive"])

ax.set_title("Linear SVM — Sentiment Classification")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()

# ROC curve
fpr, tpr, _ = roc_curve(y_test, best_model.decision_function(X_test))
plt.plot(fpr, tpr, label=f"Linear SVM (AUC = {roc_auc_score(y_test, best_model.decision_function(X_test)):.3f})")
plt.plot([0, 1], [0, 1], "k--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.savefig("roc_curve.png", dpi=150)
plt.show()

Step 9: Improvements to Try

Negation handling:

Python

def add_negation_scope(text: str) -> str:
    # Tag words following "not", "no", "never" with _NEG suffix
    # "not good" → "not good_NEG"
    tokens = text.split()
    result = []
    negate = False
    negators = {"not", "no", "never", "neither", "nor", "without"}
    for token in tokens:
        if token in negators:
            negate = True
            result.append(token)
        elif token in {".", ",", "!", "?", "but", "however"}:
            negate = False
            result.append(token)
        elif negate:
            result.append(token + "_NEG")
        else:
            result.append(token)
    return " ".join(result)

# "not good" → "not good_NEG"
# "not good but interesting" → "not good_NEG but interesting"

Aspect-level features: Extract sentence-level sentiment rather than review-level. A review saying "the acting was brilliant but the plot was terrible" has mixed sentiment — sentence-level splitting gives more signal.

Character n-grams for robustness: Misspellings like "graet" still match "great" at the character level.

Deliverables Checklist

[ ] Notebook with data exploration (review length distribution, class balance)
[ ] Preprocessing pipeline with rationale for each cleaning step
[ ] Baseline Logistic Regression with classification report
[ ] Model comparison table (minimum 3 models with 5-fold cross-validation)
[ ] ROC curve plot for all models
[ ] Confusion matrix for the best model
[ ] Error analysis: 10 false positives + 10 false negatives with explanations
[ ] Feature importance chart (top 20 positive and negative words)
[ ] One improvement implemented (negation handling or character n-grams)
[ ] README explaining the problem, approach, results, and limitations

What You Learned

After completing this project you understand:

Why accuracy is not the right metric for sentiment (balanced classes here, but real applications often are not)
That bag-of-words models are brittle to sarcasm and negation — a fundamental limitation that motivates transformer models
How to read a confusion matrix and connect it to business impact
That feature importance reveals what the model actually learned, and sometimes that is surprising
The gap between a model that works on clean test sets and one that handles the full range of real-world language

Project 2: Movie Sentiment Analysis

Project: Movie Sentiment Analysis with scikit-learn

Setup

Step 1: Load the IMDB Dataset

Step 2: Text Preprocessing

Step 3: Train/Test Split

Step 4: Baseline — Logistic Regression with TF-IDF

Step 5: Compare Models

Step 6: Error Analysis

Step 7: Feature Importance

Step 8: Confusion Matrix Visualisation

Step 9: Improvements to Try

Deliverables Checklist

What You Learned