Hyperparameters vs Parameters

The Core Distinction

Parameters:
  - Learned from data during training
  - The model's internal state that encodes what it has learned
  - Updated by gradient descent (or equivalent) to minimize loss
  - Examples: weights and biases in logistic regression, node splits in a tree

Hyperparameters:
  - Set BEFORE training begins
  - Control the training process or model structure
  - NOT updated by gradient descent
  - Optimized by a separate loop: train → evaluate → adjust → repeat
  - Examples: learning rate, C in logistic regression, max_depth in a tree

Concrete Examples

Python

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# LOGISTIC REGRESSION
# Parameters:    coef_ (weights for each feature), intercept_
# Hyperparameters: C (regularization), penalty (L1/L2), solver, max_iter

lr = LogisticRegression(
    C=0.1,           # ← HYPERPARAMETER: regularization strength
    penalty="l2",    # ← HYPERPARAMETER: regularization type
    max_iter=1000,   # ← HYPERPARAMETER: training budget
)
lr.fit(X_train, y_train)

print("Parameters (learned):")
print(f"  coef_:      {lr.coef_[0][:5]}...")   # → weights for each feature
print(f"  intercept_: {lr.intercept_}")

print("\nHyperparameters (set before training):")
print(f"  C={lr.C}, penalty={lr.penalty}")

Python

# DECISION TREE
# Parameters:    threshold values at each split, feature indices, leaf values
# Hyperparameters: max_depth, min_samples_leaf, min_samples_split, criterion

dt = DecisionTreeClassifier(
    max_depth=5,           # ← HYPERPARAMETER
    min_samples_leaf=10,   # ← HYPERPARAMETER
    criterion="gini",      # ← HYPERPARAMETER
)
dt.fit(X_train, y_train)

print("Parameters (learned):")
print(f"  n_leaves: {dt.get_n_leaves()}")
print(f"  tree depth: {dt.get_depth()}")
print(f"  feature_importances_: {dt.feature_importances_[:5]}")

print("Hyperparameters:")
print(f"  max_depth={dt.max_depth}, min_samples_leaf={dt.min_samples_leaf}")

Python

# RANDOM FOREST
# Parameters:    each individual tree's internal state (splits and thresholds)
# Hyperparameters: n_estimators, max_depth, max_features, min_samples_leaf

rf = RandomForestClassifier(
    n_estimators=200,     # ← HYPERPARAMETER: how many trees
    max_depth=6,          # ← HYPERPARAMETER: how deep each tree can grow
    max_features="sqrt",  # ← HYPERPARAMETER: features considered at each split
    min_samples_leaf=5,   # ← HYPERPARAMETER: minimum leaf size
    random_state=42,
)
rf.fit(X_train, y_train)

print(f"n_estimators set as hyperparameter: {rf.n_estimators}")
print(f"Parameters: {rf.n_estimators} trees, each with learned thresholds")

How Each Is Optimized

Python

# PARAMETERS: gradient descent (automatic, during training)
# The optimizer computes gradients of the loss w.r.t. each parameter
# and updates them to minimize loss — no human involvement needed

# Training loop (simplified):
for epoch in range(100):
    predictions = model.forward(X_batch)
    loss = compute_loss(predictions, y_batch)
    gradients = compute_gradients(loss, model.parameters)
    model.parameters -= learning_rate * gradients   # parameter update

# HYPERPARAMETERS: search loop (manual or automated)
# No gradient exists for hyperparameters — you must evaluate and compare
# Typical approach: train multiple models, compare CV performance, pick best

from sklearn.model_selection import GridSearchCV, StratifiedKFold

param_grid = {
    "max_depth": [3, 5, 7, 10],
    "min_samples_leaf": [1, 5, 10, 20],
}

search = GridSearchCV(
    DecisionTreeClassifier(),
    param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring="roc_auc",
)
search.fit(X_train, y_train)
print(f"Best hyperparameters: {search.best_params_}")
# At these hyperparameters, the learned parameters (splits) are determined by training

Common Hyperparameters by Algorithm

| Algorithm | Key Hyperparameters | |---|---| | Logistic Regression | C (regularization), penalty (L1/L2), solver | | Decision Tree | max_depth, min_samples_leaf, criterion | | Random Forest | n_estimators, max_depth, max_features, min_samples_leaf | | Gradient Boosting | n_estimators, max_depth, learning_rate, subsample | | SVM | C, kernel (rbf/linear), gamma | | k-NN | n_neighbors, metric (euclidean/manhattan), weights | | Neural Network | learning_rate, hidden_sizes, dropout, batch_size, epochs | | Ridge/Lasso | alpha (= λ) |

Why the Distinction Matters

Python

# Mistake 1: Tuning hyperparameters on the test set
# → Test set measures final performance; using it to pick hyperparameters leaks it
# → Always tune on validation set, evaluate final model on test set

# Mistake 2: Treating learning rate as "just another weight"
# → Learning rate is not updated by gradient descent
# → It's set by the user and controls how parameters are updated

# Mistake 3: Counting parameters when comparing models
# → More parameters ≠ more complex necessarily
# → Hyperparameters like max_depth also control model complexity

# Correct workflow:
# 1. Training set: update parameters (model learns)
# 2. Validation set: evaluate and tune hyperparameters
# 3. Test set: final evaluation only (no decisions made from this)

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2)

# Hyperparameter tuning happens entirely on train_val (internally using CV)
search = GridSearchCV(RandomForestClassifier(), param_grid={"n_estimators": [50, 100, 200]}, cv=5)
search.fit(X_train_val, y_train_val)

# Test evaluation only at the very end
print(f"Test AUC: {roc_auc_score(y_test, search.best_estimator_.predict_proba(X_test)[:, 1]):.3f}")

Neural Network: Both in One Model

Python

import torch.nn as nn

class WarfarinDoseNet(nn.Module):
    # HYPERPARAMETERS (set before training):
    #   n_features = 20 (architecture choice)
    #   hidden_dim = 64 (architecture choice)
    #   dropout_rate = 0.3 (regularization hyperparameter)

    def __init__(self, n_features: int = 20, hidden_dim: int = 64, dropout_rate: float = 0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, hidden_dim),  # hidden_dim is a hyperparameter
            nn.ReLU(),
            nn.Dropout(dropout_rate),            # dropout_rate is a hyperparameter
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, x):
        return self.net(x)

model = WarfarinDoseNet(hidden_dim=64, dropout_rate=0.3)

# PARAMETERS (learned during training by optimizer):
print("Parameters:")
for name, param in model.named_parameters():
    print(f"  {name}: shape={tuple(param.shape)}, requires_grad={param.requires_grad}")
# → net.0.weight (20×64), net.0.bias (64), net.3.weight (64×1), net.3.bias (1)

Interview Answer Template

Q: What's the difference between parameters and hyperparameters?

Parameters are the internal values a model learns from data during training — the weights and biases in logistic regression, the split thresholds in a decision tree. They're updated automatically by the optimization algorithm (gradient descent or equivalent) to minimize the training loss. Hyperparameters are set before training begins and control how training happens or how the model is structured — the regularization strength C, the number of trees in a random forest, the learning rate. They're not updated by gradient descent; you optimize them through a separate search loop: train the model at a given hyperparameter setting, evaluate on a validation set, adjust, repeat. The practical implication: hyperparameters must be tuned on the validation set — never on the training set (overfits the training process) and never on the test set (leaks the final evaluation). Nested cross-validation separates hyperparameter tuning from model evaluation for a clean estimate.

Hyperparameters vs Parameters

The Core Distinction

Concrete Examples

How Each Is Optimized

Common Hyperparameters by Algorithm

Why the Distinction Matters

Neural Network: Both in One Model

Interview Answer Template

Enjoyed this article?

Leave a comment