Machine Learning Foundations · Lesson 27 of 70

What is Variance in Machine Learning?

What is Variance?

Variance measures how much the model's predictions change when trained on different subsets of the data. A high-variance model is very sensitive to the specific training examples it sees — small changes in training data lead to very different models.

Low variance:  train the model 10 times on 10 different subsets
               → predictions are nearly identical each time
               → model learned a stable pattern

High variance: train the model 10 times on 10 different subsets
               → predictions are very different each time
               → model memorized noise in each subset

High Variance = Overfitting

Python

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

np.random.seed(42)
X = np.random.randn(300, 10)
y = (X[:, 0] + X[:, 1] + np.random.randn(300) * 0.3 > 0).astype(int)

# High variance: unconstrained decision tree
tree = DecisionTreeClassifier(max_depth=None)
tree_scores = cross_val_score(tree, X, y, cv=10, scoring="accuracy")
print(f"Deep tree: mean={tree_scores.mean():.3f}, std={tree_scores.std():.3f}")
# std is high → large variance across folds

# Low variance: logistic regression
lr = LogisticRegression()
lr_scores = cross_val_score(lr, X, y, cv=10, scoring="accuracy")
print(f"Logistic:  mean={lr_scores.mean():.3f}, std={lr_scores.std():.3f}")
# std is low → stable across folds

# High variance model changes a lot depending on which training data it saw

Demonstrating Variance Directly

Python

from sklearn.model_selection import train_test_split

def measure_model_variance(model_class, model_kwargs, X, y, n_trials=20):
    """
    Train the same model on n different random subsets.
    High variance: predictions vary a lot across trials.
    """
    test_scores = []
    for seed in range(n_trials):
        X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
        model = model_class(**model_kwargs)
        model.fit(X_tr, y_tr)
        test_scores.append(model.score(X_te, y_te))

    scores = np.array(test_scores)
    print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}, Range: [{scores.min():.3f}, {scores.max():.3f}]")
    return scores

print("Deep tree (high variance):")
measure_model_variance(DecisionTreeClassifier, {"max_depth": None}, X, y)
# Std is high — very sensitive to training data

print("Constrained tree (low variance):")
measure_model_variance(DecisionTreeClassifier, {"max_depth": 4}, X, y)
# Std is lower — more stable

What Causes High Variance?

| Cause | Explanation | |---|---| | Model too complex | Enough parameters to memorize individual training examples | | Too little training data | Few examples → model is sensitive to each one | | Too many features | Many irrelevant features create noise that gets memorized | | Noise in labels | Mislabeled examples get memorized as "true" patterns | | No regularization | Nothing prevents the model from fitting every data point |

Reducing Variance

1. Ensembles (Bagging)

Bagging (Bootstrap Aggregating) trains many models on random subsets and averages predictions. Individual models may have high variance, but the average is much more stable.

Python

from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

# Random Forest: bagging of decision trees
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

print("Random Forest (low variance via bagging):")
measure_model_variance(
    RandomForestClassifier,
    {"n_estimators": 100, "max_depth": 5, "random_state": 0},
    X, y
)
# Much lower std than a single deep tree

Why bagging works: individual trees overfit differently to their bootstrap samples. The errors cancel out when averaged.

2. Regularization

Python

from sklearn.linear_model import Ridge

# More regularization → lower variance, higher bias
for alpha in [0.001, 0.1, 1.0, 10.0, 100.0]:
    ridge_scores = cross_val_score(Ridge(alpha=alpha), X, y, cv=10)
    print(f"Ridge alpha={alpha:6.3f}: mean={ridge_scores.mean():.3f}, std={ridge_scores.std():.3f}")
# As alpha increases: std decreases (lower variance), but mean may also decrease (higher bias)

3. More Training Data

Python

# With more data, the specific training subset matters less
for n in [100, 500, 2000, 10000]:
    X_big, y_big = make_data(n)
    scores = cross_val_score(DecisionTreeClassifier(max_depth=None), X_big, y_big, cv=5)
    print(f"n={n:5d}: std={scores.std():.4f}")
# As n increases, variance decreases — even for an unconstrained tree

4. Dropout (Neural Networks)

Dropout is essentially a form of ensemble learning — each forward pass trains a different subset of neurons. The final model averages predictions across an exponential number of sub-networks.

Variance in Cross-Validation

The standard deviation of cross-validation scores directly estimates variance:

Python

scores = cross_val_score(model, X, y, cv=10, scoring="accuracy")

print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Std deviation: {scores.std():.3f}")   # Estimate of variance

# Rule of thumb:
# std > 0.05: high variance — model is unstable
# std < 0.02: low variance — stable across folds

Interview Answer Template

Q: What is variance in machine learning?

Variance measures how much the model's predictions change depending on the specific training data it was trained on. A high-variance model (like an unconstrained decision tree) memorizes the training set — train it on a slightly different subset and you get a completely different model. This shows up as high standard deviation in cross-validation scores. High variance is essentially overfitting: the model is sensitive to noise in the training data rather than learning the underlying pattern. The fixes include regularization (penalizing complexity), ensembles (bagging averages out variance across many models), collecting more data (larger datasets make any single example less influential), and constraining model capacity. The bias-variance tradeoff is the core tension: reducing variance typically increases bias, and vice versa.

What is Bias in Machine Learning?

Next Lesson

The Bias-Variance Tradeoff Explained