Bayesian Hyperparameter Optimization
Bayesian optimization for hyperparameter tuning: surrogate models, acquisition functions, how it differs from grid and random search, and practical usage with Optuna and scikit-optimize.
Why Bayesian Optimization
Grid search: exhaustive ā tries everything in the grid
Random search: faster ā samples without memory of past results
Bayesian optimization: uses past results to choose the NEXT point intelligently
Intuition:
If C=0.01 and C=0.001 both gave poor results, no point trying C=0.005.
If C=1.0 gave the best result, it's worth trying C=0.5 and C=2.0 next.
Bayesian optimization builds a probabilistic model (surrogate) of the
hyperparameter ā performance mapping, then uses it to pick the next
trial where performance is likely to be high.The Components
1. Surrogate model:
Models p(performance | hyperparameters) from past results.
Common: Gaussian Process (GP), Tree-structured Parzen Estimator (TPE).
2. Acquisition function:
Given the surrogate, chooses the next hyperparameter to try.
Balances exploration (uncertain regions) vs exploitation (near current best).
Common: Expected Improvement (EI), Upper Confidence Bound (UCB).
3. Loop:
Trial 1: random initialization (no prior knowledge)
Trial 2+: surrogate-guided ā pick next point that maximizes acquisition function,
train model, update surrogate with new result, repeat.Using Optuna
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial: optuna.Trial) -> float:
"""
Returns the CV AUC for a given set of hyperparameters.
Optuna calls this function, choosing hyperparameters based on past results.
"""
n_estimators = trial.suggest_int("n_estimators", 50, 500)
max_depth = trial.suggest_int("max_depth", 2, 8)
learning_rate = trial.suggest_float("learning_rate", 0.005, 0.5, log=True) # log-scale
subsample = trial.suggest_float("subsample", 0.5, 1.0)
min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 30)
model = Pipeline([
("scaler", StandardScaler()),
("gbm", GradientBoostingClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
learning_rate=learning_rate,
subsample=subsample,
min_samples_leaf=min_samples_leaf,
random_state=42,
)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc")
return scores.mean()
# Run the optimization
study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)
print(f"\nBest trial:")
print(f" AUC: {study.best_value:.4f}")
print(f" Params: {study.best_params}")Inspecting the Study
import pandas as pd
# All trial results
trials_df = study.trials_dataframe()
print("Top 5 trials by AUC:")
print(trials_df.sort_values("value", ascending=False).head(5)[
["number", "value", "params_n_estimators", "params_max_depth", "params_learning_rate"]
].to_string(index=False))
# Convergence: does AUC improve over time?
print("\nAUC over trials (first 5 and last 5):")
best_so_far = trials_df["value"].cummax()
for i in [0, 5, 10, 20, 49]:
if i < len(best_so_far):
print(f" After trial {i+1:2d}: best AUC = {best_so_far.iloc[i]:.4f}")
# Parameter importance (Optuna built-in)
importance = optuna.importance.get_param_importances(study)
print("\nHyperparameter importance:")
for param, imp in importance.items():
bar = "ā" * int(imp * 30)
print(f" {param:<20}: {imp:.3f} {bar}")Using scikit-optimize (skopt)
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", GradientBoostingClassifier(random_state=42)),
])
search_space = {
"model__n_estimators": Integer(50, 500),
"model__max_depth": Integer(2, 8),
"model__learning_rate": Real(0.005, 0.5, prior="log-uniform"),
"model__subsample": Real(0.5, 1.0),
"model__min_samples_leaf": Integer(1, 30),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
bayes_search = BayesSearchCV(
pipeline,
search_space,
n_iter=50,
cv=cv,
scoring="roc_auc",
random_state=42,
n_jobs=-1,
verbose=1,
)
bayes_search.fit(X_train, y_train)
print(f"Best params: {bayes_search.best_params_}")
print(f"Best CV AUC: {bayes_search.best_score_:.4f}")Pruning: Stop Bad Trials Early (Optuna)
import optuna
import numpy as np
def objective_with_pruning(trial: optuna.Trial) -> float:
"""
Train iteratively and prune unpromising trials early.
"""
n_estimators = trial.suggest_int("n_estimators", 50, 500)
max_depth = trial.suggest_int("max_depth", 2, 8)
learning_rate = trial.suggest_float("learning_rate", 0.005, 0.5, log=True)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
model = GradientBoostingClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
learning_rate=learning_rate,
random_state=42,
)
# Progressive evaluation ā report intermediate results for pruning
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for step, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
model.fit(X_train[train_idx], y_train[train_idx])
score = roc_auc_score(y_train[val_idx], model.predict_proba(X_train[val_idx])[:, 1])
scores.append(score)
# Report intermediate result to Optuna
trial.report(np.mean(scores), step)
# Prune if this trial is looking much worse than the best so far
if trial.should_prune():
raise optuna.exceptions.TrialPruned()
return np.mean(scores)
pruning_study = optuna.create_study(
direction="maximize",
pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=2),
)
pruning_study.optimize(objective_with_pruning, n_trials=50)
print(f"Best AUC: {pruning_study.best_value:.4f} (with pruning, faster convergence)")Comparison: Grid vs Random vs Bayesian
| Aspect | Grid Search | Random Search | Bayesian Optimization | |---|---|---|---| | Strategy | Exhaustive | Random | Surrogate-guided | | Memory of past results | No | No | Yes | | Good for small space | ā Best | OK | Overkill | | Good for large space | Infeasible | OK | ā Best | | Parallelizable | Fully | Fully | Partially (async) | | Interpretable results | Easy | Easy | Needs tools (Optuna) | | Implementation complexity | Low | Low | Moderate | | Early stopping of bad trials | No | No | Yes (pruning) |
Interview Answer Template
Q: What is Bayesian hyperparameter optimization and when do you use it?
Bayesian optimization maintains a probabilistic model (surrogate) of the mapping from hyperparameters to performance, updated after each trial. It uses an acquisition function to choose the next point to evaluate ā balancing exploration (uncertain regions) with exploitation (regions near the current best). This is fundamentally smarter than random search, which has no memory of past results. The most popular implementation is Optuna with the TPE (Tree-structured Parzen Estimator) sampler. Bayesian optimization is the right choice when each training run is expensive (large datasets, neural networks, gradient boosting with many trees), when the hyperparameter space has more than 3ā4 dimensions, and when you have a limited budget of trials. Compared to random search with the same budget, Bayesian optimization typically finds better solutions in fewer trials. I also use Optuna's pruning feature ā which stops clearly underperforming trials early ā to get even more efficiency out of the budget.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.