Learnixo
Back to blog
AI Systemsintermediate

What is Regularization?

Regularization fundamentals: why models overfit, what regularization adds to the loss function, how it constrains model complexity, and the intuition behind the bias-variance tradeoff it controls.

Asma Hafeez KhanMay 16, 20265 min read
Machine LearningRegularizationOverfittingL1L2Interview
Share:𝕏

The Problem Regularization Solves

Without constraints, a model minimizes training loss by memorizing the training data. Every noise point, every quirk, gets encoded into the weights.

Python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# No regularization: model memorizes training data
unconstrained = DecisionTreeClassifier(max_depth=None)
unconstrained.fit(X_train, y_train)

train_score = unconstrained.score(X_train, y_train)
cv_scores   = cross_val_score(unconstrained, X_train, y_train, cv=5)

print(f"Train accuracy:   {train_score:.3f}")    # 1.000 β€” perfect memorization
print(f"CV accuracy:      {cv_scores.mean():.3f}")   # much lower β€” doesn't generalize
print(f"Overfit gap:      {train_score - cv_scores.mean():.3f}")

What Regularization Does

Regularization adds a penalty term to the loss function that discourages large weights. The model must now minimize both the training loss and the weight penalty simultaneously.

Standard loss:
  Minimize: L(y, Ε·)
  β†’ Model will fit training data perfectly if it can

Regularized loss:
  Minimize: L(y, Ε·) + Ξ» Γ— Penalty(weights)
  β†’ Model must trade off: reduce training loss vs keep weights small
  β†’ Large weights that only explain noise become too expensive to keep

Ξ» (lambda/alpha) = regularization strength
  Ξ» = 0:      no regularization (standard loss)
  Ξ» = large:  heavy regularization (weights forced toward zero)

The Weight Penalty in Practice

Python
from sklearn.linear_model import LogisticRegression
import numpy as np

# LogisticRegression uses C = 1/Ξ» (inverse of regularization strength)
# Small C = strong regularization, Large C = weak regularization

results = []
for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    model = LogisticRegression(C=C, max_iter=1000, random_state=42)
    model.fit(X_train_scaled, y_train)

    train_auc = roc_auc_score(y_train, model.predict_proba(X_train_scaled)[:, 1])
    cv_auc    = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="roc_auc").mean()
    weight_norm = np.linalg.norm(model.coef_)   # total magnitude of weights

    results.append((C, train_auc, cv_auc, weight_norm))
    print(f"C={C:6}: train_AUC={train_auc:.3f}, CV_AUC={cv_auc:.3f}, |w|={weight_norm:.2f}")

# As C increases (less regularization):
# β†’ Train AUC increases (fits training data better)
# β†’ Weight norm increases (larger weights)
# β†’ CV AUC peaks then decreases (overfitting begins)

Regularization and Bias-Variance

More regularization (larger Ξ»):
  β†’ Weights are smaller and simpler
  β†’ Model is less flexible β†’ higher bias
  β†’ Model is less sensitive to training data β†’ lower variance
  β†’ Result: underfitting if Ξ» is too large

Less regularization (smaller Ξ»):
  β†’ Weights can be large and complex
  β†’ Model is more flexible β†’ lower bias
  β†’ Model memorizes training data β†’ higher variance
  β†’ Result: overfitting if Ξ» is too small

Optimal Ξ»:
  β†’ Minimizes total error = biasΒ² + variance
  β†’ Found by cross-validation

Three Types of Regularization

Python
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.neural_network import MLPClassifier
import torch.nn as nn

# L2 Regularization (Ridge): add sum of squared weights
# Loss = L(y, Ε·) + Ξ» Γ— Ξ£(wα΅’Β²)
# β†’ Shrinks all weights toward zero, but none become exactly zero
lr_l2 = LogisticRegression(penalty="l2", C=1.0, max_iter=1000)

# L1 Regularization (Lasso): add sum of absolute weights
# Loss = L(y, Ε·) + Ξ» Γ— Ξ£|wα΅’|
# β†’ Forces some weights to exactly zero β†’ automatic feature selection
lr_l1 = LogisticRegression(penalty="l1", C=1.0, solver="liblinear", max_iter=1000)

# Elastic Net: L1 + L2 combination
lr_elastic = LogisticRegression(penalty="elasticnet", C=1.0, l1_ratio=0.5,
                                solver="saga", max_iter=1000)

# Dropout (Neural Networks): randomly zero out activations during training
class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Dropout(0.3),     # 30% dropout = regularization
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(32, 1),
        )

Regularization Strength Selection

Python
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Always search for regularization strength with cross-validation
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000)),
])

param_grid = {"model__C": [0.001, 0.01, 0.1, 1, 10, 100]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)

print(f"Best C: {search.best_params_['model__C']}")
print(f"Best CV AUC: {search.best_score_:.3f}")

# Print all results
for params, mean_score, std_score in zip(
    search.cv_results_["params"],
    search.cv_results_["mean_test_score"],
    search.cv_results_["std_test_score"],
):
    print(f"  C={params['model__C']:6}: {mean_score:.3f} Β± {std_score:.3f}")

When Regularization Helps Most

High feature-to-sample ratio (many features, few samples):
  β†’ Without regularization: model uses irrelevant features, overfits
  β†’ With regularization: irrelevant weights shrink toward zero

Correlated features:
  β†’ Without L2: model assigns arbitrary weights to correlated features
  β†’ With L2: weights are spread across correlated features

Sparse signal (few truly predictive features out of many):
  β†’ With L1: automatically zeroes out non-predictive features

Neural networks with large parameter counts:
  β†’ Without dropout: neurons co-adapt and memorize training data
  β†’ With dropout: neurons must learn robust, independent features

Interview Answer Template

Q: What is regularization and why do we use it?

Regularization adds a penalty term to the loss function that discourages large model weights. Without it, a model minimizes training loss by memorizing noise in the training data β€” it overfits. The regularized objective forces the model to simultaneously reduce training loss and keep weights small, which pushes it toward simpler solutions that generalize better. The penalty strength Ξ» (or C = 1/Ξ» in sklearn) controls the tradeoff: larger Ξ» means more regularization, smaller models, higher bias, lower variance. Smaller Ξ» means less regularization, more complex models, lower bias, higher variance. The right Ξ» is found by cross-validation. The two main forms are L2 (squared weights β€” shrinks all weights, keeps all features) and L1 (absolute weights β€” drives some weights to exactly zero, performing feature selection). For neural networks, dropout serves a similar purpose by randomly disabling neurons during training.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.