Hyperparameters vs Parameters
The distinction between model parameters (learned from data) and hyperparameters (set before training): examples, how each is optimized, and why this matters for model selection and evaluation.
The Core Distinction
Parameters:
- Learned from data during training
- The model's internal state that encodes what it has learned
- Updated by gradient descent (or equivalent) to minimize loss
- Examples: weights and biases in logistic regression, node splits in a tree
Hyperparameters:
- Set BEFORE training begins
- Control the training process or model structure
- NOT updated by gradient descent
- Optimized by a separate loop: train ā evaluate ā adjust ā repeat
- Examples: learning rate, C in logistic regression, max_depth in a treeConcrete Examples
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# LOGISTIC REGRESSION
# Parameters: coef_ (weights for each feature), intercept_
# Hyperparameters: C (regularization), penalty (L1/L2), solver, max_iter
lr = LogisticRegression(
C=0.1, # ā HYPERPARAMETER: regularization strength
penalty="l2", # ā HYPERPARAMETER: regularization type
max_iter=1000, # ā HYPERPARAMETER: training budget
)
lr.fit(X_train, y_train)
print("Parameters (learned):")
print(f" coef_: {lr.coef_[0][:5]}...") # ā weights for each feature
print(f" intercept_: {lr.intercept_}")
print("\nHyperparameters (set before training):")
print(f" C={lr.C}, penalty={lr.penalty}")# DECISION TREE
# Parameters: threshold values at each split, feature indices, leaf values
# Hyperparameters: max_depth, min_samples_leaf, min_samples_split, criterion
dt = DecisionTreeClassifier(
max_depth=5, # ā HYPERPARAMETER
min_samples_leaf=10, # ā HYPERPARAMETER
criterion="gini", # ā HYPERPARAMETER
)
dt.fit(X_train, y_train)
print("Parameters (learned):")
print(f" n_leaves: {dt.get_n_leaves()}")
print(f" tree depth: {dt.get_depth()}")
print(f" feature_importances_: {dt.feature_importances_[:5]}")
print("Hyperparameters:")
print(f" max_depth={dt.max_depth}, min_samples_leaf={dt.min_samples_leaf}")# RANDOM FOREST
# Parameters: each individual tree's internal state (splits and thresholds)
# Hyperparameters: n_estimators, max_depth, max_features, min_samples_leaf
rf = RandomForestClassifier(
n_estimators=200, # ā HYPERPARAMETER: how many trees
max_depth=6, # ā HYPERPARAMETER: how deep each tree can grow
max_features="sqrt", # ā HYPERPARAMETER: features considered at each split
min_samples_leaf=5, # ā HYPERPARAMETER: minimum leaf size
random_state=42,
)
rf.fit(X_train, y_train)
print(f"n_estimators set as hyperparameter: {rf.n_estimators}")
print(f"Parameters: {rf.n_estimators} trees, each with learned thresholds")How Each Is Optimized
# PARAMETERS: gradient descent (automatic, during training)
# The optimizer computes gradients of the loss w.r.t. each parameter
# and updates them to minimize loss ā no human involvement needed
# Training loop (simplified):
for epoch in range(100):
predictions = model.forward(X_batch)
loss = compute_loss(predictions, y_batch)
gradients = compute_gradients(loss, model.parameters)
model.parameters -= learning_rate * gradients # parameter update
# HYPERPARAMETERS: search loop (manual or automated)
# No gradient exists for hyperparameters ā you must evaluate and compare
# Typical approach: train multiple models, compare CV performance, pick best
from sklearn.model_selection import GridSearchCV, StratifiedKFold
param_grid = {
"max_depth": [3, 5, 7, 10],
"min_samples_leaf": [1, 5, 10, 20],
}
search = GridSearchCV(
DecisionTreeClassifier(),
param_grid,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring="roc_auc",
)
search.fit(X_train, y_train)
print(f"Best hyperparameters: {search.best_params_}")
# At these hyperparameters, the learned parameters (splits) are determined by trainingCommon Hyperparameters by Algorithm
| Algorithm | Key Hyperparameters | |---|---| | Logistic Regression | C (regularization), penalty (L1/L2), solver | | Decision Tree | max_depth, min_samples_leaf, criterion | | Random Forest | n_estimators, max_depth, max_features, min_samples_leaf | | Gradient Boosting | n_estimators, max_depth, learning_rate, subsample | | SVM | C, kernel (rbf/linear), gamma | | k-NN | n_neighbors, metric (euclidean/manhattan), weights | | Neural Network | learning_rate, hidden_sizes, dropout, batch_size, epochs | | Ridge/Lasso | alpha (= Ī») |
Why the Distinction Matters
# Mistake 1: Tuning hyperparameters on the test set
# ā Test set measures final performance; using it to pick hyperparameters leaks it
# ā Always tune on validation set, evaluate final model on test set
# Mistake 2: Treating learning rate as "just another weight"
# ā Learning rate is not updated by gradient descent
# ā It's set by the user and controls how parameters are updated
# Mistake 3: Counting parameters when comparing models
# ā More parameters ā more complex necessarily
# ā Hyperparameters like max_depth also control model complexity
# Correct workflow:
# 1. Training set: update parameters (model learns)
# 2. Validation set: evaluate and tune hyperparameters
# 3. Test set: final evaluation only (no decisions made from this)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2)
# Hyperparameter tuning happens entirely on train_val (internally using CV)
search = GridSearchCV(RandomForestClassifier(), param_grid={"n_estimators": [50, 100, 200]}, cv=5)
search.fit(X_train_val, y_train_val)
# Test evaluation only at the very end
print(f"Test AUC: {roc_auc_score(y_test, search.best_estimator_.predict_proba(X_test)[:, 1]):.3f}")Neural Network: Both in One Model
import torch.nn as nn
class WarfarinDoseNet(nn.Module):
# HYPERPARAMETERS (set before training):
# n_features = 20 (architecture choice)
# hidden_dim = 64 (architecture choice)
# dropout_rate = 0.3 (regularization hyperparameter)
def __init__(self, n_features: int = 20, hidden_dim: int = 64, dropout_rate: float = 0.3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_features, hidden_dim), # hidden_dim is a hyperparameter
nn.ReLU(),
nn.Dropout(dropout_rate), # dropout_rate is a hyperparameter
nn.Linear(hidden_dim, 1),
)
def forward(self, x):
return self.net(x)
model = WarfarinDoseNet(hidden_dim=64, dropout_rate=0.3)
# PARAMETERS (learned during training by optimizer):
print("Parameters:")
for name, param in model.named_parameters():
print(f" {name}: shape={tuple(param.shape)}, requires_grad={param.requires_grad}")
# ā net.0.weight (20Ć64), net.0.bias (64), net.3.weight (64Ć1), net.3.bias (1)Interview Answer Template
Q: What's the difference between parameters and hyperparameters?
Parameters are the internal values a model learns from data during training ā the weights and biases in logistic regression, the split thresholds in a decision tree. They're updated automatically by the optimization algorithm (gradient descent or equivalent) to minimize the training loss. Hyperparameters are set before training begins and control how training happens or how the model is structured ā the regularization strength C, the number of trees in a random forest, the learning rate. They're not updated by gradient descent; you optimize them through a separate search loop: train the model at a given hyperparameter setting, evaluate on a validation set, adjust, repeat. The practical implication: hyperparameters must be tuned on the validation set ā never on the training set (overfits the training process) and never on the test set (leaks the final evaluation). Nested cross-validation separates hyperparameter tuning from model evaluation for a clean estimate.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.