Machine Learning Foundations · Lesson 4 of 70

How Does a Model Actually Learn?

Learning = Optimization

A model learns by finding the weights (parameters) that minimize a loss function — a number that measures how wrong the model's predictions are.

Goal: minimize Loss(predictions, true_labels)

High loss → model is very wrong → adjust weights
Low loss  → model is mostly right → keep weights

Step 1: Forward Pass — Make a Prediction

The model takes input features and produces a prediction using its current weights.

Python

import numpy as np

# Simple linear model: prediction = weights · features + bias
def predict(X, weights, bias):
    return X @ weights + bias   # Matrix multiply

# Current (random) weights — model doesn't know anything yet
weights = np.array([0.1, -0.3, 0.5])
bias = 0.0

# One patient's features
X = np.array([65.0, 78.0, 1.2])   # age, weight, creatinine

prediction = predict(X, weights, bias)
print(f"Prediction: {prediction:.2f}")   # Some number — probably wrong

Step 2: Compute Loss — How Wrong Is the Prediction?

The loss function compares the prediction to the true label and outputs a scalar error.

Python

def mse_loss(y_pred: float, y_true: float) -> float:
    """Mean Squared Error — used for regression."""
    return (y_pred - y_true) ** 2

def binary_cross_entropy(y_pred: float, y_true: int) -> float:
    """Cross-entropy — used for binary classification."""
    import math
    eps = 1e-7   # Avoid log(0)
    y_pred = max(eps, min(1 - eps, y_pred))
    return -(y_true * math.log(y_pred) + (1 - y_true) * math.log(1 - y_pred))

# True INR: 2.8, predicted: 1.5 — high loss
loss = mse_loss(1.5, 2.8)
print(f"Loss: {loss:.2f}")   # 1.69

Step 3: Backward Pass — Compute Gradients

The gradient is the direction and magnitude of the loss's change with respect to each weight. It answers: "if I increase this weight slightly, does the loss go up or down?"

Gradient of loss w.r.t. weight_i:
  ∂Loss/∂w_i = how much does loss change per unit increase in w_i?

Positive gradient → increasing w_i increases loss → decrease w_i
Negative gradient → increasing w_i decreases loss → increase w_i

In practice, frameworks like PyTorch and scikit-learn compute this automatically via automatic differentiation (autograd).

Step 4: Weight Update — Move Toward Lower Loss

Python

learning_rate = 0.01   # How big a step to take

# Gradient descent update rule
# new_weight = old_weight - learning_rate * gradient
weights = weights - learning_rate * gradient_of_loss_wrt_weights
bias    = bias    - learning_rate * gradient_of_loss_wrt_bias

The learning rate controls how large each update step is:

Too large → overshoots the minimum, loss oscillates
Too small → converges very slowly
Just right → smooth convergence

The Training Loop

Python

import numpy as np
from sklearn.linear_model import SGDRegressor

# Simplified manual training loop (illustrative — use frameworks in practice)
def train(X_train, y_train, learning_rate=0.01, epochs=100):
    n_features = X_train.shape[1]
    weights = np.zeros(n_features)
    bias = 0.0
    losses = []

    for epoch in range(epochs):
        # Forward pass
        predictions = X_train @ weights + bias

        # Compute MSE loss
        errors = predictions - y_train
        loss = np.mean(errors ** 2)
        losses.append(loss)

        # Compute gradients (MSE gradient)
        grad_w = (2 / len(y_train)) * (X_train.T @ errors)
        grad_b = (2 / len(y_train)) * np.sum(errors)

        # Update weights
        weights -= learning_rate * grad_w
        bias    -= learning_rate * grad_b

        if epoch % 10 == 0:
            print(f"Epoch {epoch:3d}: loss = {loss:.4f}")

    return weights, bias, losses

# The loss should decrease with each epoch
# Epoch   0: loss = 5.4321
# Epoch  10: loss = 2.1098
# Epoch  20: loss = 0.8734
# ...

What Happens Over Many Epochs

Loss
 │
5│ ×
4│   ×
3│     ×
2│       × ×
1│           × × × × × ×
0│                         ───────  (convergence)
 └──────────────────────────────────→ Epoch

As training progresses, the model's weights settle into values that minimize loss on the training data. This is convergence.

Key Concepts Summary

| Concept | What It Is | Intuition | |---|---|---| | Loss function | Measures prediction error | "How wrong am I?" | | Gradient | Direction of steepest increase in loss | "Which way is up?" | | Gradient descent | Move weights in opposite direction of gradient | "Walk downhill" | | Learning rate | Step size per update | "How big a step?" | | Epoch | One full pass through the training data | "One lap around the track" | | Convergence | Loss stops decreasing significantly | "Reached the valley" | | Parameters / Weights | Values updated during training | "What the model learns" |

Interview Answer Template

Q: How does a model actually learn?

Learning is an optimization process. The model starts with random weights and makes predictions. A loss function measures how wrong those predictions are — for regression, that's often mean squared error; for classification, cross-entropy. The training algorithm then computes the gradient of the loss with respect to each weight, which tells us how to adjust them to reduce error. Weights are updated by moving in the opposite direction of the gradient — this is gradient descent. We repeat this for many epochs (passes through the training data) until the loss converges. Frameworks like PyTorch compute gradients automatically via backpropagation, so we only need to define the loss function and the model architecture.

What is a Feature and a Label?

Next Lesson

ML Terminology Quick-Reference for Interviews