Naive Bayes Classifier

The Core Idea

Bayes' theorem for classification:
  P(class | features) ∝ P(features | class) × P(class)

Problem: P(features | class) = P(x₁, x₂, ..., xₙ | class)
  With d features, this requires estimating a d-dimensional joint distribution
  With binary features and d=20: 2²⁰ = 1M probabilities to estimate

Naive Bayes assumption: features are conditionally independent given the class
  P(x₁, x₂, ..., xₙ | class) = Π P(xᵢ | class)
  
  Now: only d distributions to estimate — one per feature per class
  Tractable even with thousands of features

Three Variants

Gaussian Naive Bayes:
  Continuous features
  P(xᵢ | class = k) ~ Normal(μᵢₖ, σᵢₖ²)
  Estimates: mean and variance of each feature within each class
  Use for: continuous measurement data (lab values, sensor readings)

Multinomial Naive Bayes:
  Count features (e.g., word counts in a document)
  P(word | class) ∝ (count of word in class documents + α)
  Laplace smoothing: α = 1 prevents zero probabilities
  Use for: text classification, document classification

Bernoulli Naive Bayes:
  Binary features (e.g., word present/absent)
  P(xᵢ = 1 | class) = pᵢₖ (learned from data)
  Explicitly penalises for absent words that predict the class
  Use for: short text, binary feature matrices

Implementation from Scratch

Python

import numpy as np
from collections import defaultdict

class NaiveBayesClassifier:
    def __init__(self, smoothing: float = 1.0):
        self.smoothing = smoothing
        self.class_log_priors: dict[int, float] = {}
        self.feature_log_probs: dict[int, np.ndarray] = {}  # class → feature probs
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Multinomial Naive Bayes for count features."""
        classes = np.unique(y)
        n_samples, n_features = X.shape
        
        for c in classes:
            mask = (y == c)
            n_class = mask.sum()
            
            # Log prior: log P(class)
            self.class_log_priors[c] = np.log(n_class / n_samples)
            
            # Feature counts in this class (sum over documents)
            feature_counts = X[mask].sum(axis=0)  # shape: (n_features,)
            
            # Laplace smoothing
            smoothed_counts = feature_counts + self.smoothing
            total_count = smoothed_counts.sum()
            
            # Log likelihood: log P(xᵢ | class)
            self.feature_log_probs[c] = np.log(smoothed_counts / total_count)
        
        return self
    
    def predict_log_proba(self, X: np.ndarray) -> np.ndarray:
        """Log P(class | x) ∝ log P(class) + Σᵢ xᵢ × log P(featureᵢ | class)."""
        classes = sorted(self.class_log_priors.keys())
        log_proba = np.zeros((X.shape[0], len(classes)))
        
        for i, c in enumerate(classes):
            # For multinomial: log P(doc | class) = Σ count(word) × log P(word | class)
            log_proba[:, i] = self.class_log_priors[c] + X.dot(self.feature_log_probs[c])
        
        return log_proba
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        log_proba = self.predict_log_proba(X)
        return np.array(sorted(self.class_log_priors.keys()))[np.argmax(log_proba, axis=1)]
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        log_proba = self.predict_log_proba(X)
        # Convert log scores to probabilities via softmax
        log_proba -= log_proba.max(axis=1, keepdims=True)  # numerical stability
        proba = np.exp(log_proba)
        return proba / proba.sum(axis=1, keepdims=True)

Using scikit-learn

Python

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Text classification pipeline
text_clf = Pipeline([
    ("vect", CountVectorizer()),
    ("tfidf", TfidfTransformer()),
    ("clf", MultinomialNB(alpha=1.0)),
])

text_clf.fit(X_train_texts, y_train)
y_pred = text_clf.predict(X_test_texts)
print(classification_report(y_test, y_pred))

# Gaussian NB for continuous features
gnb = GaussianNB(var_smoothing=1e-9)  # adds small variance to prevent zero
gnb.fit(X_train, y_train)
print(f"Accuracy: {gnb.score(X_test, y_test):.4f}")

# Access learned parameters
print(f"Class means (theta): {gnb.theta_}")      # (n_classes, n_features)
print(f"Class variances (var): {gnb.var_}")       # (n_classes, n_features)
print(f"Class log priors: {gnb.class_log_prior_}")

When Naive Bayes Works Despite the Wrong Assumption

The naive assumption (feature independence given class) is almost always wrong.
Yet Naive Bayes often performs well. Why?

1. Decision boundary, not probabilities:
   For classification, we only care about which class has highest P(class|x)
   Even if absolute probabilities are wrong, the RANKING may be correct
   → Correct classification despite miscalibrated probabilities

2. Many weak features:
   Text classification: "warfarin" and "anticoagulant" are correlated
   But with 10,000 features, the correlation of any pair is diluted
   → Marginal benefit from modelling correlations

3. Small data:
   Correctly modelling high-dimensional correlations requires huge datasets
   With small data, the variance of a complex model > bias of Naive Bayes
   → Naive Bayes wins the bias-variance trade-off with few samples

When it fails:
   Strong correlations between few features (e.g., highly correlated lab values)
   When calibrated probabilities matter (not just ranking)
   When features are truly continuous and non-Gaussian

Interview Answer

"Naive Bayes applies Bayes' theorem with the conditional independence assumption: P(x₁,...,xₙ|class) = Π P(xᵢ|class). This makes training tractable — estimate one distribution per feature per class rather than a full d-dimensional joint. Three variants: Gaussian (continuous features), Multinomial (word counts, with Laplace smoothing), Bernoulli (binary features). Despite the independence assumption being almost always wrong, Naive Bayes often works because: (1) classification needs correct ranking, not correct probabilities; (2) with many features, pairwise correlations are diluted; (3) with small data, it wins the bias-variance trade-off over complex models. It fails when features are strongly correlated or when calibrated probabilities (not just rankings) matter."

Naive Bayes Classifier

The Core Idea

Three Variants

Implementation from Scratch

Using scikit-learn

When Naive Bayes Works Despite the Wrong Assumption

Interview Answer

Enjoyed this article?

Leave a comment