AI/ML/NLP Research Track · Lesson 10 of 16
Project 2: Movie Sentiment Analysis
Project: Movie Sentiment Analysis with scikit-learn
This project moves beyond toy classification into the messy reality of user-generated text. Movie reviews contain sarcasm, slang, mixed opinions, and non-standard language — exactly the kind of noise you encounter in production sentiment analysis. By the end you will understand not just how to train a classifier but how to interpret what it got wrong and why.
What you will build:
- Positive/negative sentiment classifier on IMDB reviews
- Preprocessing pipeline for noisy text
- Comparison of multiple models with cross-validation
- Error analysis on misclassified reviews
- Feature importance interpretation
Setup
pip install scikit-learn pandas numpy matplotlib seabornimport pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, roc_curve
)Step 1: Load the IMDB Dataset
# IMDB Large Movie Review Dataset — 50,000 reviews, balanced 50/50
# Download from: https://ai.stanford.edu/~amaas/data/sentiment/
# Or use the Hugging Face datasets version:
# from datasets import load_dataset
# dataset = load_dataset("imdb")
df = pd.read_csv("imdb_reviews.csv") # expects columns: review, sentiment
# sentiment: "positive" or "negative"
print(df["sentiment"].value_counts())
# positive 25000
# negative 25000
print(f"Average review length: {df['review'].str.split().str.len().mean():.0f} words")
# Average review length: 233 words — much longer than SMS, needs different strategyStep 2: Text Preprocessing
def clean_review(text: str) -> str:
# Remove HTML tags (IMDB reviews often contain <br /> tags)
text = re.sub(r"<[^>]+>", " ", text)
# Normalise URLs
text = re.sub(r"http\S+|www\S+", " URL ", text)
# Expand common contractions
contractions = {
"n't": " not", "'re": " are", "'ve": " have",
"'ll": " will", "'d": " would", "'m": " am",
}
for pattern, replacement in contractions.items():
text = text.replace(pattern, replacement)
# Lowercase
text = text.lower()
# Remove punctuation except apostrophes
text = re.sub(r"[^\w\s]", " ", text)
# Collapse whitespace
text = re.sub(r"\s+", " ", text).strip()
return text
df["clean_review"] = df["review"].apply(clean_review)
df["label"] = (df["sentiment"] == "positive").astype(int)
# Sanity check
print("Original:", df["review"].iloc[0][:100])
print("Cleaned: ", df["clean_review"].iloc[0][:100])Step 3: Train/Test Split
X = df["clean_review"]
y = df["label"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y
)
print(f"Train: {len(X_train)} reviews")
print(f"Test: {len(X_test)} reviews")Step 4: Baseline — Logistic Regression with TF-IDF
baseline = Pipeline([
("tfidf", TfidfVectorizer(
max_features=50_000,
ngram_range=(1, 2),
sublinear_tf=True,
stop_words="english",
)),
("model", LogisticRegression(C=1.0, max_iter=1000, solver="lbfgs"))
])
baseline.fit(X_train, y_train)
y_pred_base = baseline.predict(X_test)
y_prob_base = baseline.predict_proba(X_test)[:, 1]
print("Baseline (LR + TF-IDF):")
print(classification_report(y_test, y_pred_base, target_names=["negative", "positive"]))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob_base):.4f}") precision recall f1-score support
negative 0.90 0.89 0.89 5000
positive 0.89 0.90 0.89 5000
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
ROC-AUC: 0.9621Step 5: Compare Models
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = {
"Logistic Regression": Pipeline([
("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2), sublinear_tf=True)),
("model", LogisticRegression(C=1.0, max_iter=1000)),
]),
"Linear SVM": Pipeline([
("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2), sublinear_tf=True)),
("model", LinearSVC(C=1.0, max_iter=2000)),
]),
"Naive Bayes": Pipeline([
("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2))),
("model", MultinomialNB(alpha=0.05)),
]),
"LR + Character N-grams": Pipeline([
("tfidf", TfidfVectorizer(
max_features=50_000,
analyzer="char_wb", # character n-grams within word boundaries
ngram_range=(3, 5),
sublinear_tf=True,
)),
("model", LogisticRegression(C=1.0, max_iter=1000)),
]),
}
print(f"{'Model':<30} {'Mean F1':>10} {'Std F1':>10}")
print("-" * 52)
for name, pipeline in models.items():
scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring="f1_macro")
print(f"{name:<30} {scores.mean():>10.4f} {scores.std():>10.4f}")Model Mean F1 Std F1
----------------------------------------------------
Logistic Regression 0.8921 0.0031
Linear SVM 0.8972 0.0028
Naive Bayes 0.8612 0.0041
LR + Character N-grams 0.8843 0.0035Step 6: Error Analysis
best_model = models["Linear SVM"]
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
mask_fp = (y_pred == 1) & (y_test == 0) # predicted positive, actually negative
mask_fn = (y_pred == 0) & (y_test == 1) # predicted negative, actually positive
print(f"False positives (neg→pos): {mask_fp.sum()}")
print(f"False negatives (pos→neg): {mask_fn.sum()}")# Inspect false positives — reviews the model thought were positive but were negative
fp_reviews = X_test[mask_fp].values
print("\nFalse Positive Examples (predicted positive, actually negative):")
for r in fp_reviews[:5]:
print(f"\n '{r[:150]}'")False Positive Examples:
'the acting was magnificent and the cinematography was stunning but the plot made absolutely no sense and left me deeply unsatisfied'
# Problem: many positive words + negative conclusion — model weights individual words, not structure
'great movie if you enjoy watching paint dry for two hours'
# Problem: sarcasm — "great" is positive but the phrase is negative# Inspect false negatives — reviews the model thought were negative but were positive
fn_reviews = X_test[mask_fn].values
print("\nFalse Negative Examples (predicted negative, actually positive):")
for r in fn_reviews[:5]:
print(f"\n '{r[:150]}'")False Negative Examples:
'not the worst film i ve seen this year it actually exceeded my low expectations'
# Problem: "not the worst" is positive but loaded with negative words
'i went in expecting garbage and came out pleasantly surprised'
# Problem: negative framing of a positive sentimentStep 7: Feature Importance
# Refit Logistic Regression for interpretability
lr = models["Logistic Regression"]
lr.fit(X_train, y_train)
tfidf = lr.named_steps["tfidf"]
clf = lr.named_steps["model"]
features = tfidf.get_feature_names_out()
coef = clf.coef_[0]
# Most positive and negative indicators
top_pos = np.argsort(coef)[-20:][::-1]
top_neg = np.argsort(coef)[:20]
print("Top POSITIVE indicators:")
for i in top_pos:
print(f" {features[i]:<30} {coef[i]:+.3f}")
print("\nTop NEGATIVE indicators:")
for i in top_neg:
print(f" {features[i]:<30} {coef[i]:+.3f}")Top POSITIVE indicators:
excellent +4.231
wonderful +3.987
masterpiece +3.654
beautifully +3.421
brilliant +3.312
...
Top NEGATIVE indicators:
worst -4.876
awful -4.321
terrible -4.109
boring -3.987
waste -3.654
...
# Note: "not bad" would have both "not" (-3.1) and "bad" (-3.8)
# The model scores them independently, missing the negation structureStep 8: Confusion Matrix Visualisation
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=["Predicted Negative", "Predicted Positive"],
yticklabels=["Actual Negative", "Actual Positive"])
ax.set_title("Linear SVM — Sentiment Classification")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()
# ROC curve
fpr, tpr, _ = roc_curve(y_test, best_model.decision_function(X_test))
plt.plot(fpr, tpr, label=f"Linear SVM (AUC = {roc_auc_score(y_test, best_model.decision_function(X_test)):.3f})")
plt.plot([0, 1], [0, 1], "k--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.savefig("roc_curve.png", dpi=150)
plt.show()Step 9: Improvements to Try
Negation handling:
def add_negation_scope(text: str) -> str:
# Tag words following "not", "no", "never" with _NEG suffix
# "not good" → "not good_NEG"
tokens = text.split()
result = []
negate = False
negators = {"not", "no", "never", "neither", "nor", "without"}
for token in tokens:
if token in negators:
negate = True
result.append(token)
elif token in {".", ",", "!", "?", "but", "however"}:
negate = False
result.append(token)
elif negate:
result.append(token + "_NEG")
else:
result.append(token)
return " ".join(result)
# "not good" → "not good_NEG"
# "not good but interesting" → "not good_NEG but interesting"Aspect-level features: Extract sentence-level sentiment rather than review-level. A review saying "the acting was brilliant but the plot was terrible" has mixed sentiment — sentence-level splitting gives more signal.
Character n-grams for robustness: Misspellings like "graet" still match "great" at the character level.
Deliverables Checklist
[ ] Notebook with data exploration (review length distribution, class balance)
[ ] Preprocessing pipeline with rationale for each cleaning step
[ ] Baseline Logistic Regression with classification report
[ ] Model comparison table (minimum 3 models with 5-fold cross-validation)
[ ] ROC curve plot for all models
[ ] Confusion matrix for the best model
[ ] Error analysis: 10 false positives + 10 false negatives with explanations
[ ] Feature importance chart (top 20 positive and negative words)
[ ] One improvement implemented (negation handling or character n-grams)
[ ] README explaining the problem, approach, results, and limitationsWhat You Learned
After completing this project you understand:
- Why accuracy is not the right metric for sentiment (balanced classes here, but real applications often are not)
- That bag-of-words models are brittle to sarcasm and negation — a fundamental limitation that motivates transformer models
- How to read a confusion matrix and connect it to business impact
- That feature importance reveals what the model actually learned, and sometimes that is surprising
- The gap between a model that works on clean test sets and one that handles the full range of real-world language