What is Regression?

The Core Idea

Regression predicts a continuous numeric value — not a category.

Classification:  predict which bin?  → "anticoagulant", "antidiabetic"
Regression:      predict a number?   → 5.2mg, 2.8, $12,400, 3.7 days

Linear Regression

The simplest regression model: a weighted sum of features plus a bias.

prediction = w₁·feature₁ + w₂·feature₂ + ... + wₙ·featureₙ + b

Python

from sklearn.linear_model import LinearRegression
import numpy as np

# Predict warfarin dose (mg/day) from patient features
X_train = np.array([
    [65, 78, 1.1, 2.3],   # age, weight_kg, creatinine, target_inr
    [72, 85, 1.4, 2.5],
    [58, 62, 0.9, 2.0],
    [80, 70, 1.8, 2.5],
])
y_train = np.array([5.0, 4.5, 6.5, 3.0])   # Dose in mg/day

model = LinearRegression()
model.fit(X_train, y_train)

# Interpretation: how much does each feature affect dose?
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

new_patient = np.array([[68, 75, 1.2, 2.5]])
print(f"Predicted dose: {model.predict(new_patient)[0]:.2f} mg/day")

Loss Functions for Regression

MSE — Mean Squared Error

Python

def mse(y_true, y_pred):
    errors = y_true - y_pred
    return np.mean(errors ** 2)

Penalizes large errors more (squared)
Units are squared (mg² for a dose prediction)
Default for most regression tasks

RMSE — Root Mean Squared Error

Python

def rmse(y_true, y_pred):
    return np.sqrt(mse(y_true, y_pred))

Same units as the target (mg for dose)
More interpretable than MSE
Still sensitive to outliers

MAE — Mean Absolute Error

Python

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

Same units as target
Robust to outliers (doesn't square errors)
Use when large errors aren't disproportionately worse

Evaluating Regression Models

Python

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_true = np.array([5.0, 4.5, 6.5, 3.0, 5.5])
y_pred = np.array([4.8, 4.7, 6.2, 3.3, 5.1])

print(f"MSE:  {mean_squared_error(y_true, y_pred):.4f}")
print(f"RMSE: {mean_squared_error(y_true, y_pred, squared=False):.4f}")
print(f"MAE:  {mean_absolute_error(y_true, y_pred):.4f}")
print(f"R²:   {r2_score(y_true, y_pred):.4f}")

R² (Coefficient of Determination)

R² = 1 - (MSE of model / MSE of baseline mean predictor)

R² = 1.0: perfect predictions
R² = 0.0: model is no better than predicting the mean
R² < 0:   model is worse than predicting the mean (very bad)

Polynomial Regression

When the relationship isn't linear, add polynomial features.

Python

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Dose-response curve for a drug may be non-linear (sigmoid-like)
X = np.array([[1], [2], [3], [4], [5]])   # dose
y = np.array([0.1, 0.4, 0.7, 0.85, 0.9])  # response

poly_model = Pipeline([
    ("poly", PolynomialFeatures(degree=3)),   # Add x², x³ features
    ("linear", LinearRegression()),
])
poly_model.fit(X, y)

Common Regression Algorithms

| Algorithm | When to Use | |---|---| | Linear Regression | Baseline; interpretable coefficients | | Ridge Regression | Linear with L2 regularization; multicollinearity | | Lasso Regression | Linear with L1; automatic feature selection | | Random Forest Regressor | Non-linear, robust to outliers | | Gradient Boosting (XGBoost) | Best on structured data | | Neural Network | Complex patterns, large datasets | | Support Vector Regressor | High-dimensional, small datasets |

Regression vs LLM Cost Prediction

Python

# Predicting LLM API cost: a regression problem
# Input features: prompt_tokens, model_name_encoded, user_tier
# Target: cost_usd (continuous)

import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor

api_logs = pd.DataFrame({
    "prompt_tokens":     [120, 450, 2100, 800, 350],
    "completion_tokens": [80,  120,  500, 200, 150],
    "model_encoded":     [0,   1,    1,   0,   1],    # 0=mini, 1=full
    "cost_usd":          [0.0003, 0.014, 0.068, 0.020, 0.011],
})

X = api_logs[["prompt_tokens", "completion_tokens", "model_encoded"]]
y = api_logs["cost_usd"]

model = GradientBoostingRegressor(n_estimators=100)
model.fit(X, y)
predicted_cost = model.predict([[1500, 300, 1]])
print(f"Predicted cost: ${predicted_cost[0]:.4f}")

Interview Answer Template

Q: What is regression and when would you use it?

Regression is a supervised learning task where the goal is to predict a continuous numeric value — like a drug dose in milligrams, a patient's INR level, or an LLM API cost in dollars. The model learns a function from features to a real number. The most common loss function is Mean Squared Error (MSE), which penalizes large errors more heavily. For evaluation, I use RMSE (same units as the target, interpretable) and R² (fraction of variance explained — 1.0 is perfect, 0 means no better than predicting the mean). Common algorithms include linear regression for baselines and interpretability, and gradient boosting for best performance on structured data. Regression is appropriate whenever the output is a number on a continuous scale, as opposed to classification which predicts a category.

What is Regression?

The Core Idea

Linear Regression

Loss Functions for Regression

MSE — Mean Squared Error

RMSE — Root Mean Squared Error

MAE — Mean Absolute Error

Evaluating Regression Models

R² (Coefficient of Determination)

Polynomial Regression

Common Regression Algorithms

Regression vs LLM Cost Prediction

Interview Answer Template

Enjoyed this article?

Leave a comment