What is Variance in Machine Learning?
Understand variance in ML: sensitivity to training data noise, high-variance models, overfitting connection, and how to measure and reduce variance with regularization, ensembles, and more data.
What is Variance?
Variance measures how much the model's predictions change when trained on different subsets of the data. A high-variance model is very sensitive to the specific training examples it sees ā small changes in training data lead to very different models.
Low variance: train the model 10 times on 10 different subsets
ā predictions are nearly identical each time
ā model learned a stable pattern
High variance: train the model 10 times on 10 different subsets
ā predictions are very different each time
ā model memorized noise in each subsetHigh Variance = Overfitting
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
np.random.seed(42)
X = np.random.randn(300, 10)
y = (X[:, 0] + X[:, 1] + np.random.randn(300) * 0.3 > 0).astype(int)
# High variance: unconstrained decision tree
tree = DecisionTreeClassifier(max_depth=None)
tree_scores = cross_val_score(tree, X, y, cv=10, scoring="accuracy")
print(f"Deep tree: mean={tree_scores.mean():.3f}, std={tree_scores.std():.3f}")
# std is high ā large variance across folds
# Low variance: logistic regression
lr = LogisticRegression()
lr_scores = cross_val_score(lr, X, y, cv=10, scoring="accuracy")
print(f"Logistic: mean={lr_scores.mean():.3f}, std={lr_scores.std():.3f}")
# std is low ā stable across folds
# High variance model changes a lot depending on which training data it sawDemonstrating Variance Directly
from sklearn.model_selection import train_test_split
def measure_model_variance(model_class, model_kwargs, X, y, n_trials=20):
"""
Train the same model on n different random subsets.
High variance: predictions vary a lot across trials.
"""
test_scores = []
for seed in range(n_trials):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
model = model_class(**model_kwargs)
model.fit(X_tr, y_tr)
test_scores.append(model.score(X_te, y_te))
scores = np.array(test_scores)
print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}, Range: [{scores.min():.3f}, {scores.max():.3f}]")
return scores
print("Deep tree (high variance):")
measure_model_variance(DecisionTreeClassifier, {"max_depth": None}, X, y)
# Std is high ā very sensitive to training data
print("Constrained tree (low variance):")
measure_model_variance(DecisionTreeClassifier, {"max_depth": 4}, X, y)
# Std is lower ā more stableWhat Causes High Variance?
| Cause | Explanation | |---|---| | Model too complex | Enough parameters to memorize individual training examples | | Too little training data | Few examples ā model is sensitive to each one | | Too many features | Many irrelevant features create noise that gets memorized | | Noise in labels | Mislabeled examples get memorized as "true" patterns | | No regularization | Nothing prevents the model from fitting every data point |
Reducing Variance
1. Ensembles (Bagging)
Bagging (Bootstrap Aggregating) trains many models on random subsets and averages predictions. Individual models may have high variance, but the average is much more stable.
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
# Random Forest: bagging of decision trees
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
print("Random Forest (low variance via bagging):")
measure_model_variance(
RandomForestClassifier,
{"n_estimators": 100, "max_depth": 5, "random_state": 0},
X, y
)
# Much lower std than a single deep treeWhy bagging works: individual trees overfit differently to their bootstrap samples. The errors cancel out when averaged.
2. Regularization
from sklearn.linear_model import Ridge
# More regularization ā lower variance, higher bias
for alpha in [0.001, 0.1, 1.0, 10.0, 100.0]:
ridge_scores = cross_val_score(Ridge(alpha=alpha), X, y, cv=10)
print(f"Ridge alpha={alpha:6.3f}: mean={ridge_scores.mean():.3f}, std={ridge_scores.std():.3f}")
# As alpha increases: std decreases (lower variance), but mean may also decrease (higher bias)3. More Training Data
# With more data, the specific training subset matters less
for n in [100, 500, 2000, 10000]:
X_big, y_big = make_data(n)
scores = cross_val_score(DecisionTreeClassifier(max_depth=None), X_big, y_big, cv=5)
print(f"n={n:5d}: std={scores.std():.4f}")
# As n increases, variance decreases ā even for an unconstrained tree4. Dropout (Neural Networks)
Dropout is essentially a form of ensemble learning ā each forward pass trains a different subset of neurons. The final model averages predictions across an exponential number of sub-networks.
Variance in Cross-Validation
The standard deviation of cross-validation scores directly estimates variance:
scores = cross_val_score(model, X, y, cv=10, scoring="accuracy")
print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Std deviation: {scores.std():.3f}") # Estimate of variance
# Rule of thumb:
# std > 0.05: high variance ā model is unstable
# std < 0.02: low variance ā stable across foldsInterview Answer Template
Q: What is variance in machine learning?
Variance measures how much the model's predictions change depending on the specific training data it was trained on. A high-variance model (like an unconstrained decision tree) memorizes the training set ā train it on a slightly different subset and you get a completely different model. This shows up as high standard deviation in cross-validation scores. High variance is essentially overfitting: the model is sensitive to noise in the training data rather than learning the underlying pattern. The fixes include regularization (penalizing complexity), ensembles (bagging averages out variance across many models), collecting more data (larger datasets make any single example less influential), and constraining model capacity. The bias-variance tradeoff is the core tension: reducing variance typically increases bias, and vice versa.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.