Learnixo

Machine Learning Foundations · Lesson 27 of 70

What is Variance in Machine Learning?

What is Variance?

Variance measures how much the model's predictions change when trained on different subsets of the data. A high-variance model is very sensitive to the specific training examples it sees — small changes in training data lead to very different models.

Low variance:  train the model 10 times on 10 different subsets
               → predictions are nearly identical each time
               → model learned a stable pattern

High variance: train the model 10 times on 10 different subsets
               → predictions are very different each time
               → model memorized noise in each subset

High Variance = Overfitting

Python
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

np.random.seed(42)
X = np.random.randn(300, 10)
y = (X[:, 0] + X[:, 1] + np.random.randn(300) * 0.3 > 0).astype(int)

# High variance: unconstrained decision tree
tree = DecisionTreeClassifier(max_depth=None)
tree_scores = cross_val_score(tree, X, y, cv=10, scoring="accuracy")
print(f"Deep tree: mean={tree_scores.mean():.3f}, std={tree_scores.std():.3f}")
# std is high  large variance across folds

# Low variance: logistic regression
lr = LogisticRegression()
lr_scores = cross_val_score(lr, X, y, cv=10, scoring="accuracy")
print(f"Logistic:  mean={lr_scores.mean():.3f}, std={lr_scores.std():.3f}")
# std is low  stable across folds

# High variance model changes a lot depending on which training data it saw

Demonstrating Variance Directly

Python
from sklearn.model_selection import train_test_split

def measure_model_variance(model_class, model_kwargs, X, y, n_trials=20):
    """
    Train the same model on n different random subsets.
    High variance: predictions vary a lot across trials.
    """
    test_scores = []
    for seed in range(n_trials):
        X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
        model = model_class(**model_kwargs)
        model.fit(X_tr, y_tr)
        test_scores.append(model.score(X_te, y_te))

    scores = np.array(test_scores)
    print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}, Range: [{scores.min():.3f}, {scores.max():.3f}]")
    return scores

print("Deep tree (high variance):")
measure_model_variance(DecisionTreeClassifier, {"max_depth": None}, X, y)
# Std is high  very sensitive to training data

print("Constrained tree (low variance):")
measure_model_variance(DecisionTreeClassifier, {"max_depth": 4}, X, y)
# Std is lower  more stable

What Causes High Variance?

| Cause | Explanation | |---|---| | Model too complex | Enough parameters to memorize individual training examples | | Too little training data | Few examples → model is sensitive to each one | | Too many features | Many irrelevant features create noise that gets memorized | | Noise in labels | Mislabeled examples get memorized as "true" patterns | | No regularization | Nothing prevents the model from fitting every data point |


Reducing Variance

1. Ensembles (Bagging)

Bagging (Bootstrap Aggregating) trains many models on random subsets and averages predictions. Individual models may have high variance, but the average is much more stable.

Python
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

# Random Forest: bagging of decision trees
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

print("Random Forest (low variance via bagging):")
measure_model_variance(
    RandomForestClassifier,
    {"n_estimators": 100, "max_depth": 5, "random_state": 0},
    X, y
)
# Much lower std than a single deep tree

Why bagging works: individual trees overfit differently to their bootstrap samples. The errors cancel out when averaged.


2. Regularization

Python
from sklearn.linear_model import Ridge

# More regularization  lower variance, higher bias
for alpha in [0.001, 0.1, 1.0, 10.0, 100.0]:
    ridge_scores = cross_val_score(Ridge(alpha=alpha), X, y, cv=10)
    print(f"Ridge alpha={alpha:6.3f}: mean={ridge_scores.mean():.3f}, std={ridge_scores.std():.3f}")
# As alpha increases: std decreases (lower variance), but mean may also decrease (higher bias)

3. More Training Data

Python
# With more data, the specific training subset matters less
for n in [100, 500, 2000, 10000]:
    X_big, y_big = make_data(n)
    scores = cross_val_score(DecisionTreeClassifier(max_depth=None), X_big, y_big, cv=5)
    print(f"n={n:5d}: std={scores.std():.4f}")
# As n increases, variance decreases  even for an unconstrained tree

4. Dropout (Neural Networks)

Dropout is essentially a form of ensemble learning — each forward pass trains a different subset of neurons. The final model averages predictions across an exponential number of sub-networks.


Variance in Cross-Validation

The standard deviation of cross-validation scores directly estimates variance:

Python
scores = cross_val_score(model, X, y, cv=10, scoring="accuracy")

print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Std deviation: {scores.std():.3f}")   # Estimate of variance

# Rule of thumb:
# std > 0.05: high variance  model is unstable
# std < 0.02: low variance  stable across folds

Interview Answer Template

Q: What is variance in machine learning?

Variance measures how much the model's predictions change depending on the specific training data it was trained on. A high-variance model (like an unconstrained decision tree) memorizes the training set — train it on a slightly different subset and you get a completely different model. This shows up as high standard deviation in cross-validation scores. High variance is essentially overfitting: the model is sensitive to noise in the training data rather than learning the underlying pattern. The fixes include regularization (penalizing complexity), ensembles (bagging averages out variance across many models), collecting more data (larger datasets make any single example less influential), and constraining model capacity. The bias-variance tradeoff is the core tension: reducing variance typically increases bias, and vice versa.