Machine Learning Foundations · Lesson 59 of 70
L1 vs L2: When to Use Each
Side-by-Side Comparison
L1 (Lasso) L2 (Ridge)
Penalty: Σ|wᵢ| Σ(wᵢ²)
Gradient: ±λ (constant) 2λw (proportional)
Effect on weights: Some → exactly 0 All shrink, none → 0
Sparsity: Yes (feature selection) No
Correlated features: Picks one, zeros others Distributes weight
Geometry: Diamond-shaped constraint Sphere-shaped constraint
Interpretability: Fewer features to explain Coefficients spread over all features
sklearn notation: C = 1/λ, solver=liblinear C = 1/λ (default penalty)Code Comparison
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
# Drug readmission model with 30 features
feature_names = [f"feature_{i}" for i in range(30)]
models = {
"L2 (Ridge) — default": LogisticRegression(penalty="l2", C=1.0, max_iter=1000),
"L1 (Lasso), strong": LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000),
"L1 (Lasso), weak": LogisticRegression(penalty="l1", C=10.0, solver="liblinear", max_iter=1000),
"Elastic Net (L1+L2)": LogisticRegression(penalty="elasticnet", C=1.0, l1_ratio=0.5, solver="saga", max_iter=1000),
"No regularization": LogisticRegression(penalty=None, max_iter=1000),
}
print(f"{'Model':<30} {'CV AUC':>8} {'Std':>6} {'Non-zero features':>18}")
print("-" * 68)
for name, model in models.items():
pipeline = Pipeline([("scaler", StandardScaler()), ("model", model)])
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="roc_auc")
pipeline.fit(X_train, y_train)
n_nonzero = (pipeline.named_steps["model"].coef_[0] != 0).sum() if hasattr(pipeline.named_steps["model"], "coef_") else "N/A"
print(f"{name:<30} {scores.mean():>8.3f} {scores.std():>6.3f} {str(n_nonzero):>18}")Geometric Interpretation
L2 constraint: ||w||₂ ≤ t (sphere)
The feasible region is a sphere.
The optimal unconstrained solution is pulled toward the sphere surface.
The sphere is smooth — it rarely touches the axes.
→ Weights are rarely exactly zero
L1 constraint: ||w||₁ ≤ t (diamond / cross-polytope)
The feasible region is a diamond with corners on the axes.
The diamond has sharp corners at axis-aligned positions.
The constrained optimum often lands at a corner.
→ Corners correspond to sparse solutions (w = 0 for some features)
This is why L1 produces sparse solutions and L2 doesn't:
it's geometry, not magic.Correlated Features: L1 vs L2
import numpy as np
from sklearn.linear_model import Lasso, Ridge
np.random.seed(42)
n = 300
# Two correlated features: medication count, polypharmacy index
med_count = np.random.normal(8, 3, n)
polypharm = med_count * 0.8 + np.random.normal(0, 1, n) # highly correlated
# Third independent feature: age
age = np.random.normal(60, 12, n)
X_corr = np.column_stack([med_count, polypharm, age])
y = 0.5 * med_count + 0.5 * polypharm + 0.3 * age + np.random.normal(0, 2, n)
lasso = Lasso(alpha=0.5)
ridge = Ridge(alpha=5.0)
lasso.fit(X_corr, y)
ridge.fit(X_corr, y)
print("With correlated features (med_count and polypharm):")
print(f" Lasso: med_count={lasso.coef_[0]:.3f}, polypharm={lasso.coef_[1]:.3f}, age={lasso.coef_[2]:.3f}")
print(f" Ridge: med_count={ridge.coef_[0]:.3f}, polypharm={ridge.coef_[1]:.3f}, age={ridge.coef_[2]:.3f}")
print("\nLasso: arbitrarily zeroed polypharm, put all weight on med_count")
print("Ridge: distributed weight between both correlated features")
print("True coefficients: med_count=0.5, polypharm=0.5, age=0.3")Elastic Net: Best of Both
from sklearn.linear_model import LogisticRegression, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Elastic Net = L1 + L2
# Loss + α × [l1_ratio × Σ|wᵢ| + (1-l1_ratio) × Σ(wᵢ²)]
# l1_ratio = 1: pure L1
# l1_ratio = 0: pure L2
# l1_ratio = 0.5: balanced blend
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(
penalty="elasticnet",
solver="saga",
max_iter=2000,
)),
])
param_grid = {
"model__C": [0.01, 0.1, 1.0, 10.0],
"model__l1_ratio": [0.1, 0.3, 0.5, 0.7, 0.9],
}
from sklearn.model_selection import GridSearchCV, StratifiedKFold
search = GridSearchCV(
pipeline, param_grid,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring="roc_auc", n_jobs=-1
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV AUC: {search.best_score_:.3f}")Practical Decision Guide
def choose_regularization(n_features: int, n_samples: int, n_correlated_groups: int,
need_feature_selection: bool, sparse_signal: bool) -> str:
"""
Rule-based selection guide.
"""
ratio = n_features / n_samples
if not need_feature_selection and n_correlated_groups > 0:
return "L2 (Ridge) — correlated features; L2 distributes weight stably"
if sparse_signal and need_feature_selection:
if n_correlated_groups > 0:
return "Elastic Net — sparse signal with correlated features"
return "L1 (Lasso) — sparse signal, want automatic feature selection"
if ratio > 0.5:
return "L2 (Ridge) or Elastic Net — high feature-to-sample ratio needs regularization"
return "L2 (Ridge, default) — safe default for most cases"
# Examples:
print(choose_regularization(80, 200, 5, need_feature_selection=False, sparse_signal=False))
# → L2: high ratio, correlated features
print(choose_regularization(200, 500, 0, need_feature_selection=True, sparse_signal=True))
# → L1: sparse signal, need feature selection
print(choose_regularization(100, 400, 10, need_feature_selection=True, sparse_signal=True))
# → Elastic Net: sparse signal + correlated featuresQuick Reference Table
| Scenario | L1 | L2 | Elastic Net | |---|---|---|---| | Many irrelevant features | ✓ Best | OK | ✓ Good | | Correlated features | Unstable | ✓ Best | ✓ Good | | Want automatic feature selection | ✓ Yes | No | Partial | | All features expected useful | Suboptimal | ✓ Best | OK | | p >> n (more features than samples) | OK | ✓ Best | ✓ Good | | Neural networks | Dropout instead | Weight decay (✓) | N/A |
Interview Answer Template
Q: When do you use L1 vs L2 regularization?
L1 (Lasso) adds the sum of absolute weights and drives some weights to exactly zero — it performs automatic feature selection. L2 (Ridge) adds the sum of squared weights and shrinks all weights proportionally toward zero without zeroing any. The geometric reason: L1's diamond constraint has sharp corners at sparse solutions; L2's sphere constraint is smooth and rarely touches the axes. Use L1 when you have many features and expect only a few to be predictive — it will zero out the rest. Use L2 when most features have signal, or when features are correlated — L2 distributes weight stably across correlated predictors, while L1 arbitrarily picks one. If you have both concerns (sparse signal and correlated features), Elastic Net combines both penalties with a ratio parameter. In practice, L2 is the safe default for most clinical tabular models; switch to L1 or Elastic Net when you need interpretability through sparsity.