Machine Learning Foundations · Lesson 64 of 70

Bayesian Optimization for Hyperparameters

Why Bayesian Optimization

Grid search:    exhaustive — tries everything in the grid
Random search:  faster — samples without memory of past results

Bayesian optimization: uses past results to choose the NEXT point intelligently

Intuition:
  If C=0.01 and C=0.001 both gave poor results, no point trying C=0.005.
  If C=1.0 gave the best result, it's worth trying C=0.5 and C=2.0 next.

Bayesian optimization builds a probabilistic model (surrogate) of the
hyperparameter → performance mapping, then uses it to pick the next
trial where performance is likely to be high.

The Components

1. Surrogate model:
   Models p(performance | hyperparameters) from past results.
   Common: Gaussian Process (GP), Tree-structured Parzen Estimator (TPE).

2. Acquisition function:
   Given the surrogate, chooses the next hyperparameter to try.
   Balances exploration (uncertain regions) vs exploitation (near current best).
   Common: Expected Improvement (EI), Upper Confidence Bound (UCB).

3. Loop:
   Trial 1: random initialization (no prior knowledge)
   Trial 2+: surrogate-guided — pick next point that maximizes acquisition function,
             train model, update surrogate with new result, repeat.

Using Optuna

Python

import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial: optuna.Trial) -> float:
    """
    Returns the CV AUC for a given set of hyperparameters.
    Optuna calls this function, choosing hyperparameters based on past results.
    """
    n_estimators   = trial.suggest_int("n_estimators", 50, 500)
    max_depth      = trial.suggest_int("max_depth", 2, 8)
    learning_rate  = trial.suggest_float("learning_rate", 0.005, 0.5, log=True)  # log-scale
    subsample      = trial.suggest_float("subsample", 0.5, 1.0)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 30)

    model = Pipeline([
        ("scaler", StandardScaler()),
        ("gbm", GradientBoostingClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            learning_rate=learning_rate,
            subsample=subsample,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
        )),
    ])

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc")
    return scores.mean()

# Run the optimization
study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"\nBest trial:")
print(f"  AUC: {study.best_value:.4f}")
print(f"  Params: {study.best_params}")

Inspecting the Study

Python

import pandas as pd

# All trial results
trials_df = study.trials_dataframe()
print("Top 5 trials by AUC:")
print(trials_df.sort_values("value", ascending=False).head(5)[
    ["number", "value", "params_n_estimators", "params_max_depth", "params_learning_rate"]
].to_string(index=False))

# Convergence: does AUC improve over time?
print("\nAUC over trials (first 5 and last 5):")
best_so_far = trials_df["value"].cummax()
for i in [0, 5, 10, 20, 49]:
    if i < len(best_so_far):
        print(f"  After trial {i+1:2d}: best AUC = {best_so_far.iloc[i]:.4f}")

# Parameter importance (Optuna built-in)
importance = optuna.importance.get_param_importances(study)
print("\nHyperparameter importance:")
for param, imp in importance.items():
    bar = "█" * int(imp * 30)
    print(f"  {param:<20}: {imp:.3f}  {bar}")

Using scikit-optimize (skopt)

Python

from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", GradientBoostingClassifier(random_state=42)),
])

search_space = {
    "model__n_estimators":    Integer(50, 500),
    "model__max_depth":       Integer(2, 8),
    "model__learning_rate":   Real(0.005, 0.5, prior="log-uniform"),
    "model__subsample":       Real(0.5, 1.0),
    "model__min_samples_leaf": Integer(1, 30),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

bayes_search = BayesSearchCV(
    pipeline,
    search_space,
    n_iter=50,
    cv=cv,
    scoring="roc_auc",
    random_state=42,
    n_jobs=-1,
    verbose=1,
)

bayes_search.fit(X_train, y_train)

print(f"Best params: {bayes_search.best_params_}")
print(f"Best CV AUC: {bayes_search.best_score_:.4f}")

Pruning: Stop Bad Trials Early (Optuna)

Python

import optuna
import numpy as np

def objective_with_pruning(trial: optuna.Trial) -> float:
    """
    Train iteratively and prune unpromising trials early.
    """
    n_estimators  = trial.suggest_int("n_estimators", 50, 500)
    max_depth     = trial.suggest_int("max_depth", 2, 8)
    learning_rate = trial.suggest_float("learning_rate", 0.005, 0.5, log=True)

    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import roc_auc_score

    model = GradientBoostingClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
        random_state=42,
    )

    # Progressive evaluation — report intermediate results for pruning
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = []
    for step, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
        model.fit(X_train[train_idx], y_train[train_idx])
        score = roc_auc_score(y_train[val_idx], model.predict_proba(X_train[val_idx])[:, 1])
        scores.append(score)

        # Report intermediate result to Optuna
        trial.report(np.mean(scores), step)

        # Prune if this trial is looking much worse than the best so far
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return np.mean(scores)

pruning_study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=2),
)
pruning_study.optimize(objective_with_pruning, n_trials=50)
print(f"Best AUC: {pruning_study.best_value:.4f} (with pruning, faster convergence)")

Comparison: Grid vs Random vs Bayesian

| Aspect | Grid Search | Random Search | Bayesian Optimization | |---|---|---|---| | Strategy | Exhaustive | Random | Surrogate-guided | | Memory of past results | No | No | Yes | | Good for small space | ✓ Best | OK | Overkill | | Good for large space | Infeasible | OK | ✓ Best | | Parallelizable | Fully | Fully | Partially (async) | | Interpretable results | Easy | Easy | Needs tools (Optuna) | | Implementation complexity | Low | Low | Moderate | | Early stopping of bad trials | No | No | Yes (pruning) |

Interview Answer Template

Q: What is Bayesian hyperparameter optimization and when do you use it?

Bayesian optimization maintains a probabilistic model (surrogate) of the mapping from hyperparameters to performance, updated after each trial. It uses an acquisition function to choose the next point to evaluate — balancing exploration (uncertain regions) with exploitation (regions near the current best). This is fundamentally smarter than random search, which has no memory of past results. The most popular implementation is Optuna with the TPE (Tree-structured Parzen Estimator) sampler. Bayesian optimization is the right choice when each training run is expensive (large datasets, neural networks, gradient boosting with many trees), when the hyperparameter space has more than 3–4 dimensions, and when you have a limited budget of trials. Compared to random search with the same budget, Bayesian optimization typically finds better solutions in fewer trials. I also use Optuna's pruning feature — which stops clearly underperforming trials early — to get even more efficiency out of the budget.

Random Search: Smarter and Faster

Next Lesson

Interview: Hyperparameter Tuning Strategy