What is Regression?
Understand regression in machine learning: predicting continuous values, linear and polynomial regression, loss functions (MSE, MAE, RMSE), evaluation metrics, and real AI applications like dose prediction and outcome forecasting.
The Core Idea
Regression predicts a continuous numeric value ā not a category.
Classification: predict which bin? ā "anticoagulant", "antidiabetic"
Regression: predict a number? ā 5.2mg, 2.8, $12,400, 3.7 daysLinear Regression
The simplest regression model: a weighted sum of features plus a bias.
prediction = wāĀ·featureā + wāĀ·featureā + ... + wāĀ·featureā + bfrom sklearn.linear_model import LinearRegression
import numpy as np
# Predict warfarin dose (mg/day) from patient features
X_train = np.array([
[65, 78, 1.1, 2.3], # age, weight_kg, creatinine, target_inr
[72, 85, 1.4, 2.5],
[58, 62, 0.9, 2.0],
[80, 70, 1.8, 2.5],
])
y_train = np.array([5.0, 4.5, 6.5, 3.0]) # Dose in mg/day
model = LinearRegression()
model.fit(X_train, y_train)
# Interpretation: how much does each feature affect dose?
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
new_patient = np.array([[68, 75, 1.2, 2.5]])
print(f"Predicted dose: {model.predict(new_patient)[0]:.2f} mg/day")Loss Functions for Regression
MSE ā Mean Squared Error
def mse(y_true, y_pred):
errors = y_true - y_pred
return np.mean(errors ** 2)- Penalizes large errors more (squared)
- Units are squared (mg² for a dose prediction)
- Default for most regression tasks
RMSE ā Root Mean Squared Error
def rmse(y_true, y_pred):
return np.sqrt(mse(y_true, y_pred))- Same units as the target (mg for dose)
- More interpretable than MSE
- Still sensitive to outliers
MAE ā Mean Absolute Error
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))- Same units as target
- Robust to outliers (doesn't square errors)
- Use when large errors aren't disproportionately worse
Evaluating Regression Models
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_true = np.array([5.0, 4.5, 6.5, 3.0, 5.5])
y_pred = np.array([4.8, 4.7, 6.2, 3.3, 5.1])
print(f"MSE: {mean_squared_error(y_true, y_pred):.4f}")
print(f"RMSE: {mean_squared_error(y_true, y_pred, squared=False):.4f}")
print(f"MAE: {mean_absolute_error(y_true, y_pred):.4f}")
print(f"R²: {r2_score(y_true, y_pred):.4f}")R² (Coefficient of Determination)
R² = 1 - (MSE of model / MSE of baseline mean predictor)
R² = 1.0: perfect predictions
R² = 0.0: model is no better than predicting the mean
R² < 0: model is worse than predicting the mean (very bad)Polynomial Regression
When the relationship isn't linear, add polynomial features.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
# Dose-response curve for a drug may be non-linear (sigmoid-like)
X = np.array([[1], [2], [3], [4], [5]]) # dose
y = np.array([0.1, 0.4, 0.7, 0.85, 0.9]) # response
poly_model = Pipeline([
("poly", PolynomialFeatures(degree=3)), # Add x², x³ features
("linear", LinearRegression()),
])
poly_model.fit(X, y)Common Regression Algorithms
| Algorithm | When to Use | |---|---| | Linear Regression | Baseline; interpretable coefficients | | Ridge Regression | Linear with L2 regularization; multicollinearity | | Lasso Regression | Linear with L1; automatic feature selection | | Random Forest Regressor | Non-linear, robust to outliers | | Gradient Boosting (XGBoost) | Best on structured data | | Neural Network | Complex patterns, large datasets | | Support Vector Regressor | High-dimensional, small datasets |
Regression vs LLM Cost Prediction
# Predicting LLM API cost: a regression problem
# Input features: prompt_tokens, model_name_encoded, user_tier
# Target: cost_usd (continuous)
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
api_logs = pd.DataFrame({
"prompt_tokens": [120, 450, 2100, 800, 350],
"completion_tokens": [80, 120, 500, 200, 150],
"model_encoded": [0, 1, 1, 0, 1], # 0=mini, 1=full
"cost_usd": [0.0003, 0.014, 0.068, 0.020, 0.011],
})
X = api_logs[["prompt_tokens", "completion_tokens", "model_encoded"]]
y = api_logs["cost_usd"]
model = GradientBoostingRegressor(n_estimators=100)
model.fit(X, y)
predicted_cost = model.predict([[1500, 300, 1]])
print(f"Predicted cost: ${predicted_cost[0]:.4f}")Interview Answer Template
Q: What is regression and when would you use it?
Regression is a supervised learning task where the goal is to predict a continuous numeric value ā like a drug dose in milligrams, a patient's INR level, or an LLM API cost in dollars. The model learns a function from features to a real number. The most common loss function is Mean Squared Error (MSE), which penalizes large errors more heavily. For evaluation, I use RMSE (same units as the target, interpretable) and R² (fraction of variance explained ā 1.0 is perfect, 0 means no better than predicting the mean). Common algorithms include linear regression for baselines and interpretability, and gradient boosting for best performance on structured data. Regression is appropriate whenever the output is a number on a continuous scale, as opposed to classification which predicts a category.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.