Naive Bayes Classifier
A complete guide to Naive Bayes — the conditional independence assumption, variants (Gaussian, Multinomial, Bernoulli), when it works despite the assumption, and implementation.
The Core Idea
Bayes' theorem for classification:
P(class | features) ∝ P(features | class) × P(class)
Problem: P(features | class) = P(x₁, x₂, ..., xₙ | class)
With d features, this requires estimating a d-dimensional joint distribution
With binary features and d=20: 2²⁰ = 1M probabilities to estimate
Naive Bayes assumption: features are conditionally independent given the class
P(x₁, x₂, ..., xₙ | class) = Π P(xᵢ | class)
Now: only d distributions to estimate — one per feature per class
Tractable even with thousands of featuresThree Variants
Gaussian Naive Bayes:
Continuous features
P(xᵢ | class = k) ~ Normal(μᵢₖ, σᵢₖ²)
Estimates: mean and variance of each feature within each class
Use for: continuous measurement data (lab values, sensor readings)
Multinomial Naive Bayes:
Count features (e.g., word counts in a document)
P(word | class) ∝ (count of word in class documents + α)
Laplace smoothing: α = 1 prevents zero probabilities
Use for: text classification, document classification
Bernoulli Naive Bayes:
Binary features (e.g., word present/absent)
P(xᵢ = 1 | class) = pᵢₖ (learned from data)
Explicitly penalises for absent words that predict the class
Use for: short text, binary feature matricesImplementation from Scratch
import numpy as np
from collections import defaultdict
class NaiveBayesClassifier:
def __init__(self, smoothing: float = 1.0):
self.smoothing = smoothing
self.class_log_priors: dict[int, float] = {}
self.feature_log_probs: dict[int, np.ndarray] = {} # class → feature probs
def fit(self, X: np.ndarray, y: np.ndarray):
"""Multinomial Naive Bayes for count features."""
classes = np.unique(y)
n_samples, n_features = X.shape
for c in classes:
mask = (y == c)
n_class = mask.sum()
# Log prior: log P(class)
self.class_log_priors[c] = np.log(n_class / n_samples)
# Feature counts in this class (sum over documents)
feature_counts = X[mask].sum(axis=0) # shape: (n_features,)
# Laplace smoothing
smoothed_counts = feature_counts + self.smoothing
total_count = smoothed_counts.sum()
# Log likelihood: log P(xᵢ | class)
self.feature_log_probs[c] = np.log(smoothed_counts / total_count)
return self
def predict_log_proba(self, X: np.ndarray) -> np.ndarray:
"""Log P(class | x) ∝ log P(class) + Σᵢ xᵢ × log P(featureᵢ | class)."""
classes = sorted(self.class_log_priors.keys())
log_proba = np.zeros((X.shape[0], len(classes)))
for i, c in enumerate(classes):
# For multinomial: log P(doc | class) = Σ count(word) × log P(word | class)
log_proba[:, i] = self.class_log_priors[c] + X.dot(self.feature_log_probs[c])
return log_proba
def predict(self, X: np.ndarray) -> np.ndarray:
log_proba = self.predict_log_proba(X)
return np.array(sorted(self.class_log_priors.keys()))[np.argmax(log_proba, axis=1)]
def predict_proba(self, X: np.ndarray) -> np.ndarray:
log_proba = self.predict_log_proba(X)
# Convert log scores to probabilities via softmax
log_proba -= log_proba.max(axis=1, keepdims=True) # numerical stability
proba = np.exp(log_proba)
return proba / proba.sum(axis=1, keepdims=True)Using scikit-learn
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Text classification pipeline
text_clf = Pipeline([
("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
("clf", MultinomialNB(alpha=1.0)),
])
text_clf.fit(X_train_texts, y_train)
y_pred = text_clf.predict(X_test_texts)
print(classification_report(y_test, y_pred))
# Gaussian NB for continuous features
gnb = GaussianNB(var_smoothing=1e-9) # adds small variance to prevent zero
gnb.fit(X_train, y_train)
print(f"Accuracy: {gnb.score(X_test, y_test):.4f}")
# Access learned parameters
print(f"Class means (theta): {gnb.theta_}") # (n_classes, n_features)
print(f"Class variances (var): {gnb.var_}") # (n_classes, n_features)
print(f"Class log priors: {gnb.class_log_prior_}")When Naive Bayes Works Despite the Wrong Assumption
The naive assumption (feature independence given class) is almost always wrong.
Yet Naive Bayes often performs well. Why?
1. Decision boundary, not probabilities:
For classification, we only care about which class has highest P(class|x)
Even if absolute probabilities are wrong, the RANKING may be correct
→ Correct classification despite miscalibrated probabilities
2. Many weak features:
Text classification: "warfarin" and "anticoagulant" are correlated
But with 10,000 features, the correlation of any pair is diluted
→ Marginal benefit from modelling correlations
3. Small data:
Correctly modelling high-dimensional correlations requires huge datasets
With small data, the variance of a complex model > bias of Naive Bayes
→ Naive Bayes wins the bias-variance trade-off with few samples
When it fails:
Strong correlations between few features (e.g., highly correlated lab values)
When calibrated probabilities matter (not just ranking)
When features are truly continuous and non-GaussianInterview Answer
"Naive Bayes applies Bayes' theorem with the conditional independence assumption: P(x₁,...,xₙ|class) = Π P(xᵢ|class). This makes training tractable — estimate one distribution per feature per class rather than a full d-dimensional joint. Three variants: Gaussian (continuous features), Multinomial (word counts, with Laplace smoothing), Bernoulli (binary features). Despite the independence assumption being almost always wrong, Naive Bayes often works because: (1) classification needs correct ranking, not correct probabilities; (2) with many features, pairwise correlations are diluted; (3) with small data, it wins the bias-variance trade-off over complex models. It fails when features are strongly correlated or when calibrated probabilities (not just rankings) matter."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.