Machine Learning Foundations · Lesson 57 of 70

L1 Regularization (Lasso) Explained

The L1 Penalty

L1 loss = Standard loss + λ × Σ|wᵢ|

Where:
  |wᵢ| = absolute value of each weight
  λ    = regularization strength (higher = more regularization)

In sklearn LogisticRegression:
  C = 1/λ (smaller C = stronger L1)
  solver must be "liblinear" or "saga" for L1

Why L1 Drives Weights to Exactly Zero

The key difference from L2: the gradient of |w| is constant (±1) for any non-zero w, regardless of magnitude. The penalty doesn't "soften" near zero — it applies a constant pull toward zero.

Python

import numpy as np

# L2 gradient: d(w²)/dw = 2w → shrinks proportionally to magnitude
# L1 gradient: d|w|/dw = sign(w) → constant pull, direction only

# Consequence: for L2, weights approach zero asymptotically
# For L1, weights that are small enough hit exactly zero

def l1_path_intuition():
    """Show how L1 eliminates weights while L2 just shrinks them."""
    w = 2.0  # initial weight
    lambda_ = 0.5

    print(f"{'Step':>4}  {'L1 weight':>12}  {'L2 weight':>12}")
    print("-" * 32)

    w_l1 = w_l2 = w
    for step in range(8):
        # L1: subtract constant λ × sign(w) each step
        w_l1 = w_l1 - lambda_ * np.sign(w_l1) if abs(w_l1) > lambda_ else 0.0
        # L2: multiply by (1 - 2λ) each step
        w_l2 = w_l2 * (1 - 0.2)  # proportional shrinkage
        print(f"{step:>4}  {w_l1:>12.4f}  {w_l2:>12.4f}")
    # L1 hits exactly 0.0 after a few steps
    # L2 never quite reaches zero (exponential decay)

l1_path_intuition()

Feature Selection with L1

Python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Drug readmission model — 30 features, many may be irrelevant
feature_names = [
    "age", "weight", "height", "bmi", "serum_creatinine", "hba1c",
    "systolic_bp", "diastolic_bp", "heart_rate", "temp",
    "num_medications", "num_diagnoses", "prior_admissions", "length_of_stay",
    "on_insulin", "on_metformin", "on_warfarin", "on_aspirin",
    "has_ckd", "has_heart_failure", "has_copd", "has_diabetes",
    "discharge_to_home", "discharge_to_snf", "discharge_to_rehab",
    "insurance_medicare", "insurance_medicaid", "insurance_private",
    "admission_month", "weekend_admission"
]

pipeline_l1 = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000)),
])

pipeline_l1.fit(X_train, y_train)
coefs = pipeline_l1.named_steps["model"].coef_[0]

# Features that survived L1 (non-zero coefficients)
selected = [(name, coef) for name, coef in zip(feature_names, coefs) if coef != 0]
zeroed   = [name for name, coef in zip(feature_names, coefs) if coef == 0]

print(f"Features selected by L1 ({len(selected)}/{len(feature_names)}):")
for name, coef in sorted(selected, key=lambda x: abs(x[1]), reverse=True):
    direction = "↑" if coef > 0 else "↓"
    print(f"  {name:<30}: {coef:+.4f} {direction}")

print(f"\nFeatures zeroed out ({len(zeroed)}):")
print(" ", ", ".join(zeroed))

The Lasso Path

Python

from sklearn.linear_model import lasso_path
from sklearn.linear_model import Ridge, Lasso
import numpy as np

# Lasso path: trace how each feature's coefficient changes as λ increases
# At λ=0: standard regression (all features active)
# As λ increases: features are zeroed out one by one

# For regression (direct Lasso):
alphas, coefs, _ = lasso_path(X_train_scaled, y_train_continuous, alphas=np.logspace(-4, 1, 50))

print("Lasso path (first few alphas):")
print(f"{'Alpha':>10}  {'Non-zero coefs':>15}")
for alpha, coef_vec in zip(alphas[::10], coefs.T[::10]):
    n_nonzero = (coef_vec != 0).sum()
    print(f"{alpha:>10.4f}  {n_nonzero:>15}")
# At small alpha: many non-zero (low regularization)
# At large alpha: few non-zero (high regularization, most features removed)

L1 in Logistic Regression

Python

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# L1 logistic regression — requires liblinear or saga solver
Cs = [0.001, 0.01, 0.1, 1, 10, 100]
results = []

for C in Cs:
    model = LogisticRegression(
        penalty="l1",
        C=C,
        solver="liblinear",
        max_iter=1000,
        random_state=42
    )
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="roc_auc")
    model.fit(X_train_scaled, y_train)
    n_selected = (model.coef_[0] != 0).sum()

    results.append((C, cv_scores.mean(), cv_scores.std(), n_selected))
    print(f"C={C:6}: AUC={cv_scores.mean():.3f} ± {cv_scores.std():.3f}, features={n_selected}")

# Pick C that maximizes CV AUC
best = max(results, key=lambda x: x[1])
print(f"\nBest C: {best[0]} — AUC={best[1]:.3f}, features selected: {best[3]}")

Using L1 for Feature Selection Then Refitting

Python

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline

# Step 1: Use L1 to identify important features
l1_selector = Pipeline([
    ("scaler", StandardScaler()),
    ("selector", SelectFromModel(
        LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000),
        prefit=False,
    )),
])

# Step 2: Refit a standard (L2) model on selected features
pipeline_l1_then_l2 = Pipeline([
    ("scaler", StandardScaler()),
    ("selector", SelectFromModel(
        LogisticRegression(penalty="l1", C=0.1, solver="liblinear", max_iter=1000),
    )),
    ("model", LogisticRegression(penalty="l2", C=1.0, max_iter=1000)),
])

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline_l1_then_l2, X_train, y_train, cv=5, scoring="roc_auc")
print(f"L1-select + L2-refit CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# Often better than L1 alone — L1 selects, L2 estimates without zero bias

When L1 Is Better Than L2

Use L1 when:
  - You have many features and suspect most are irrelevant (sparse signal)
  - Interpretability matters — you want the model to use as few features as possible
  - You need automatic feature selection without a separate selection step
  - The dataset has more features than samples (p >> n) — L1 is stable in this regime

Use L2 instead when:
  - Most features are expected to have some signal (dense signal)
  - You want to keep all features but shrink them proportionally
  - Features are highly correlated — L1 picks one arbitrarily, L2 spreads weight
  - Gradient descent stability is needed (L2 gradient is smooth; L1 gradient is not)

Interview Answer Template

Q: What is L1 regularization (Lasso) and how does it differ from L2?

L1 regularization adds the sum of absolute weights to the loss function: Loss + λ × Σ|wᵢ|. The key property is that the gradient of |w| is ±1 regardless of the weight's magnitude — it applies a constant pull toward zero. This causes weights to hit exactly zero, which L2 cannot do (L2 gradient is 2w, which shrinks proportionally and never quite reaches zero). The result is automatic feature selection: at high enough regularization strength, L1 zeroes out irrelevant features entirely. I use L1 when I have many features and expect only a few to be truly predictive — clinical models with dozens of lab values, demographics, and diagnosis codes are a good example. The tradeoff: L1 can be unstable with correlated features (it picks one and ignores the rest, which can be arbitrary). For correlated features, L2 or Elastic Net is more stable.

What is Regularization?

Next Lesson

L2 Regularization (Ridge) Explained