Machine Learning Foundations · Lesson 59 of 70

L1 vs L2: When to Use Each

Side-by-Side Comparison

                    L1 (Lasso)                L2 (Ridge)
Penalty:            Σ|wᵢ|                     Σ(wᵢ²)
Gradient:           ±λ (constant)             2λw (proportional)
Effect on weights:  Some → exactly 0          All shrink, none → 0
Sparsity:           Yes (feature selection)   No
Correlated features: Picks one, zeros others  Distributes weight
Geometry:           Diamond-shaped constraint  Sphere-shaped constraint
Interpretability:   Fewer features to explain Coefficients spread over all features
sklearn notation:   C = 1/λ, solver=liblinear C = 1/λ (default penalty)

Code Comparison

Python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# Drug readmission model with 30 features
feature_names = [f"feature_{i}" for i in range(30)]

models = {
    "L2 (Ridge) — default":  LogisticRegression(penalty="l2", C=1.0, max_iter=1000),
    "L1 (Lasso), strong":    LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000),
    "L1 (Lasso), weak":      LogisticRegression(penalty="l1", C=10.0, solver="liblinear", max_iter=1000),
    "Elastic Net (L1+L2)":   LogisticRegression(penalty="elasticnet", C=1.0, l1_ratio=0.5, solver="saga", max_iter=1000),
    "No regularization":     LogisticRegression(penalty=None, max_iter=1000),
}

print(f"{'Model':<30}  {'CV AUC':>8}  {'Std':>6}  {'Non-zero features':>18}")
print("-" * 68)

for name, model in models.items():
    pipeline = Pipeline([("scaler", StandardScaler()), ("model", model)])
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="roc_auc")
    pipeline.fit(X_train, y_train)
    n_nonzero = (pipeline.named_steps["model"].coef_[0] != 0).sum() if hasattr(pipeline.named_steps["model"], "coef_") else "N/A"
    print(f"{name:<30}  {scores.mean():>8.3f}  {scores.std():>6.3f}  {str(n_nonzero):>18}")

Geometric Interpretation

L2 constraint: ||w||₂ ≤ t  (sphere)
  The feasible region is a sphere.
  The optimal unconstrained solution is pulled toward the sphere surface.
  The sphere is smooth — it rarely touches the axes.
  → Weights are rarely exactly zero

L1 constraint: ||w||₁ ≤ t  (diamond / cross-polytope)
  The feasible region is a diamond with corners on the axes.
  The diamond has sharp corners at axis-aligned positions.
  The constrained optimum often lands at a corner.
  → Corners correspond to sparse solutions (w = 0 for some features)

This is why L1 produces sparse solutions and L2 doesn't:
it's geometry, not magic.

Correlated Features: L1 vs L2

Python

import numpy as np
from sklearn.linear_model import Lasso, Ridge

np.random.seed(42)
n = 300

# Two correlated features: medication count, polypharmacy index
med_count = np.random.normal(8, 3, n)
polypharm  = med_count * 0.8 + np.random.normal(0, 1, n)  # highly correlated

# Third independent feature: age
age = np.random.normal(60, 12, n)

X_corr = np.column_stack([med_count, polypharm, age])
y = 0.5 * med_count + 0.5 * polypharm + 0.3 * age + np.random.normal(0, 2, n)

lasso = Lasso(alpha=0.5)
ridge = Ridge(alpha=5.0)

lasso.fit(X_corr, y)
ridge.fit(X_corr, y)

print("With correlated features (med_count and polypharm):")
print(f"  Lasso: med_count={lasso.coef_[0]:.3f}, polypharm={lasso.coef_[1]:.3f}, age={lasso.coef_[2]:.3f}")
print(f"  Ridge: med_count={ridge.coef_[0]:.3f}, polypharm={ridge.coef_[1]:.3f}, age={ridge.coef_[2]:.3f}")

print("\nLasso: arbitrarily zeroed polypharm, put all weight on med_count")
print("Ridge: distributed weight between both correlated features")
print("True coefficients: med_count=0.5, polypharm=0.5, age=0.3")

Elastic Net: Best of Both

Python

from sklearn.linear_model import LogisticRegression, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Elastic Net = L1 + L2
# Loss + α × [l1_ratio × Σ|wᵢ| + (1-l1_ratio) × Σ(wᵢ²)]
# l1_ratio = 1: pure L1
# l1_ratio = 0: pure L2
# l1_ratio = 0.5: balanced blend

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        penalty="elasticnet",
        solver="saga",
        max_iter=2000,
    )),
])

param_grid = {
    "model__C":        [0.01, 0.1, 1.0, 10.0],
    "model__l1_ratio": [0.1, 0.3, 0.5, 0.7, 0.9],
}

from sklearn.model_selection import GridSearchCV, StratifiedKFold
search = GridSearchCV(
    pipeline, param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring="roc_auc", n_jobs=-1
)
search.fit(X_train, y_train)

print(f"Best params: {search.best_params_}")
print(f"Best CV AUC: {search.best_score_:.3f}")

Practical Decision Guide

Python

def choose_regularization(n_features: int, n_samples: int, n_correlated_groups: int,
                          need_feature_selection: bool, sparse_signal: bool) -> str:
    """
    Rule-based selection guide.
    """
    ratio = n_features / n_samples

    if not need_feature_selection and n_correlated_groups > 0:
        return "L2 (Ridge) — correlated features; L2 distributes weight stably"

    if sparse_signal and need_feature_selection:
        if n_correlated_groups > 0:
            return "Elastic Net — sparse signal with correlated features"
        return "L1 (Lasso) — sparse signal, want automatic feature selection"

    if ratio > 0.5:
        return "L2 (Ridge) or Elastic Net — high feature-to-sample ratio needs regularization"

    return "L2 (Ridge, default) — safe default for most cases"

# Examples:
print(choose_regularization(80, 200, 5, need_feature_selection=False, sparse_signal=False))
# → L2: high ratio, correlated features

print(choose_regularization(200, 500, 0, need_feature_selection=True, sparse_signal=True))
# → L1: sparse signal, need feature selection

print(choose_regularization(100, 400, 10, need_feature_selection=True, sparse_signal=True))
# → Elastic Net: sparse signal + correlated features

Quick Reference Table

| Scenario | L1 | L2 | Elastic Net | |---|---|---|---| | Many irrelevant features | ✓ Best | OK | ✓ Good | | Correlated features | Unstable | ✓ Best | ✓ Good | | Want automatic feature selection | ✓ Yes | No | Partial | | All features expected useful | Suboptimal | ✓ Best | OK | | p >> n (more features than samples) | OK | ✓ Best | ✓ Good | | Neural networks | Dropout instead | Weight decay (✓) | N/A |

Interview Answer Template

Q: When do you use L1 vs L2 regularization?

L1 (Lasso) adds the sum of absolute weights and drives some weights to exactly zero — it performs automatic feature selection. L2 (Ridge) adds the sum of squared weights and shrinks all weights proportionally toward zero without zeroing any. The geometric reason: L1's diamond constraint has sharp corners at sparse solutions; L2's sphere constraint is smooth and rarely touches the axes. Use L1 when you have many features and expect only a few to be predictive — it will zero out the rest. Use L2 when most features have signal, or when features are correlated — L2 distributes weight stably across correlated predictors, while L1 arbitrarily picks one. If you have both concerns (sparse signal and correlated features), Elastic Net combines both penalties with a ratio parameter. In practice, L2 is the safe default for most clinical tabular models; switch to L1 or Elastic Net when you need interpretability through sparsity.

L2 Regularization (Ridge) Explained

Next Lesson

Interview: Regularization in Practice