Machine Learning Foundations · Lesson 15 of 70

Interview: Regression vs Classification Scenarios

The Decision Question

Ask yourself: what is the output?

A number on a continuous scale → Regression
A category or label → Classification
Both → often regression (classification can be derived from it via threshold)

Scenario 1: Predict Whether a Patient Will Be Readmitted Within 30 Days

Output: Yes or No — a binary outcome.

Answer: Binary Classification

Python

# y = 0 (not readmitted) or 1 (readmitted within 30 days)
# Model: XGBoost or logistic regression
# Metric: AUC-ROC (because class imbalance expected — few readmissions)

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, max_depth=4)
model.fit(X_train, y_train)   # y_train: 0 or 1

probs = model.predict_proba(X_test)[:, 1]   # P(readmitted)

Interview note: mention class imbalance — readmission is a minority event. You'd use scale_pos_weight or SMOTE, and evaluate with AUC-ROC rather than accuracy.

Scenario 2: Predict a Patient's Warfarin Dose

Output: A dose in mg/day — a continuous number.

Answer: Regression

Python

# y = dose in mg (e.g., 3.5, 5.0, 7.5)
# Model: linear regression, ridge, or gradient boosting regressor
# Metric: RMSE (interpretable: same units as dose), R²

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)   # y_train: float values

predicted_dose = model.predict(X_test)   # e.g., [4.8, 6.2, 3.1]

Interview note: you might also frame as ordinal regression if doses come in fixed increments (1.5, 2.0, 2.5 mg tablets), but continuous regression is the standard starting point.

Scenario 3: Classify Whether an LLM Response is Safe, Borderline, or Unsafe

Output: One of three categories — safe, borderline, unsafe.

Answer: Multi-Class Classification

Python

# y = 0 (safe), 1 (borderline), 2 (unsafe)
# Model: fine-tuned BERT or logistic regression on embeddings
# Metric: macro F1 (important to catch all unsafe responses)

from sklearn.linear_model import LogisticRegression

# Features: sentence embedding of LLM response
# 3 classes: safe / borderline / unsafe
model = LogisticRegression(multi_class="multinomial", max_iter=1000)
model.fit(X_train, y_train)

# Return probabilities for all three classes
probs = model.predict_proba(X_test)   # shape (n_samples, 3)

Interview note: consider the threshold — for safety, you might lower the threshold for "unsafe" class to catch more true positives (higher recall), accepting more false positives.

Scenario 4: Predict Tomorrow's Stock Price for a Pharmaceutical Company

Output: A price in dollars — a continuous number.

Answer: Regression

Python

# y = closing price tomorrow (e.g., $142.50)
# Features: historical prices, volume, earnings reports, drug approval news
# Model: LSTM, gradient boosting regressor, or time-series models (ARIMA, Prophet)
# Metric: RMSE, MAE, MAPE (mean absolute percentage error)

Interview note: predicting stock prices is notoriously difficult. A more useful framing might be "predict whether the price will go up or down" → binary classification. Both are valid; interviewers want to see you recognize the tradeoffs.

Scenario 5: Given a Drug's SMILES String, Predict Whether It Will Pass Phase II Clinical Trials

Output: Pass or Fail — binary.

Answer: Binary Classification

Python

# y = 0 (fail) or 1 (pass Phase II)
# Features: molecular fingerprints, molecular weight, logP, ADMET properties
# Model: Random Forest or graph neural network (GNN) on molecular graphs
# Metric: AUC-ROC (class imbalance: most drugs fail)

# Important: extreme class imbalance — historically ~10-20% pass rate
# Use class_weight='balanced' or SMOTE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, class_weight="balanced")

Scenario 6: Score How Similar Two Clinical Notes Are

Output: A similarity score from 0 to 1 — a continuous number.

Answer: Regression (or a special case — metric learning / similarity learning)

Python

# y = human-rated similarity score (0.0 = completely different, 1.0 = identical)
# Features: embeddings of both notes, or a pair-encoded representation
# Model: Siamese neural network, sentence-transformers, or cross-encoder

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")
embedding1 = model.encode("Patient is on warfarin 5mg for AF")
embedding2 = model.encode("Anticoagulation with warfarin for atrial fibrillation")

import numpy as np
similarity = np.dot(embedding1, embedding2) / (
    np.linalg.norm(embedding1) * np.linalg.norm(embedding2)
)
print(f"Cosine similarity: {similarity:.3f}")   # ~0.89

Ambiguous Cases

"Predict drug dosage category"

Could be either:

Regression if you predict the exact dose (5.0 mg) and bin it afterward
Ordinal classification if the dose categories are ordered (low/medium/high)
Multi-class classification if categories aren't ordered (Q1/Q2/Q3 dosing schedule)

The right choice depends on whether the exact value or the rank matters.

Quick Reference

| Scenario | Task | Why | |---|---|---| | Readmission risk (yes/no) | Binary classification | Discrete outcome | | Warfarin dose (mg) | Regression | Continuous number | | Safety label (safe/borderline/unsafe) | Multi-class classification | 3 discrete categories | | Drug adverse effects (multiple) | Multi-label classification | Multiple can co-occur | | Stock price tomorrow | Regression | Continuous number | | Clinical trial pass/fail | Binary classification | Binary discrete outcome | | Note similarity score | Regression | Continuous [0, 1] | | Patient cluster membership | Unsupervised clustering | No labels available |

What is a Decision Boundary?

Next Lesson

Why Do We Split Data?