Interview: Regression vs Classification Scenarios
Interview walk-through: identify whether a problem is regression or classification from the task description — with 6 real scenarios covering clinical AI, LLM systems, and healthcare applications.
The Decision Question
Ask yourself: what is the output?
- A number on a continuous scale → Regression
- A category or label → Classification
- Both → often regression (classification can be derived from it via threshold)
Scenario 1: Predict Whether a Patient Will Be Readmitted Within 30 Days
Output: Yes or No — a binary outcome.
Answer: Binary Classification
# y = 0 (not readmitted) or 1 (readmitted within 30 days)
# Model: XGBoost or logistic regression
# Metric: AUC-ROC (because class imbalance expected — few readmissions)
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, max_depth=4)
model.fit(X_train, y_train) # y_train: 0 or 1
probs = model.predict_proba(X_test)[:, 1] # P(readmitted)Interview note: mention class imbalance — readmission is a minority event. You'd use scale_pos_weight or SMOTE, and evaluate with AUC-ROC rather than accuracy.
Scenario 2: Predict a Patient's Warfarin Dose
Output: A dose in mg/day — a continuous number.
Answer: Regression
# y = dose in mg (e.g., 3.5, 5.0, 7.5)
# Model: linear regression, ridge, or gradient boosting regressor
# Metric: RMSE (interpretable: same units as dose), R²
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train) # y_train: float values
predicted_dose = model.predict(X_test) # e.g., [4.8, 6.2, 3.1]Interview note: you might also frame as ordinal regression if doses come in fixed increments (1.5, 2.0, 2.5 mg tablets), but continuous regression is the standard starting point.
Scenario 3: Classify Whether an LLM Response is Safe, Borderline, or Unsafe
Output: One of three categories — safe, borderline, unsafe.
Answer: Multi-Class Classification
# y = 0 (safe), 1 (borderline), 2 (unsafe)
# Model: fine-tuned BERT or logistic regression on embeddings
# Metric: macro F1 (important to catch all unsafe responses)
from sklearn.linear_model import LogisticRegression
# Features: sentence embedding of LLM response
# 3 classes: safe / borderline / unsafe
model = LogisticRegression(multi_class="multinomial", max_iter=1000)
model.fit(X_train, y_train)
# Return probabilities for all three classes
probs = model.predict_proba(X_test) # shape (n_samples, 3)Interview note: consider the threshold — for safety, you might lower the threshold for "unsafe" class to catch more true positives (higher recall), accepting more false positives.
Scenario 4: Predict Tomorrow's Stock Price for a Pharmaceutical Company
Output: A price in dollars — a continuous number.
Answer: Regression
# y = closing price tomorrow (e.g., $142.50)
# Features: historical prices, volume, earnings reports, drug approval news
# Model: LSTM, gradient boosting regressor, or time-series models (ARIMA, Prophet)
# Metric: RMSE, MAE, MAPE (mean absolute percentage error)Interview note: predicting stock prices is notoriously difficult. A more useful framing might be "predict whether the price will go up or down" → binary classification. Both are valid; interviewers want to see you recognize the tradeoffs.
Scenario 5: Given a Drug's SMILES String, Predict Whether It Will Pass Phase II Clinical Trials
Output: Pass or Fail — binary.
Answer: Binary Classification
# y = 0 (fail) or 1 (pass Phase II)
# Features: molecular fingerprints, molecular weight, logP, ADMET properties
# Model: Random Forest or graph neural network (GNN) on molecular graphs
# Metric: AUC-ROC (class imbalance: most drugs fail)
# Important: extreme class imbalance — historically ~10-20% pass rate
# Use class_weight='balanced' or SMOTE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, class_weight="balanced")Scenario 6: Score How Similar Two Clinical Notes Are
Output: A similarity score from 0 to 1 — a continuous number.
Answer: Regression (or a special case — metric learning / similarity learning)
# y = human-rated similarity score (0.0 = completely different, 1.0 = identical)
# Features: embeddings of both notes, or a pair-encoded representation
# Model: Siamese neural network, sentence-transformers, or cross-encoder
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
embedding1 = model.encode("Patient is on warfarin 5mg for AF")
embedding2 = model.encode("Anticoagulation with warfarin for atrial fibrillation")
import numpy as np
similarity = np.dot(embedding1, embedding2) / (
np.linalg.norm(embedding1) * np.linalg.norm(embedding2)
)
print(f"Cosine similarity: {similarity:.3f}") # ~0.89Ambiguous Cases
"Predict drug dosage category"
Could be either:
- Regression if you predict the exact dose (5.0 mg) and bin it afterward
- Ordinal classification if the dose categories are ordered (low/medium/high)
- Multi-class classification if categories aren't ordered (Q1/Q2/Q3 dosing schedule)
The right choice depends on whether the exact value or the rank matters.
Quick Reference
| Scenario | Task | Why | |---|---|---| | Readmission risk (yes/no) | Binary classification | Discrete outcome | | Warfarin dose (mg) | Regression | Continuous number | | Safety label (safe/borderline/unsafe) | Multi-class classification | 3 discrete categories | | Drug adverse effects (multiple) | Multi-label classification | Multiple can co-occur | | Stock price tomorrow | Regression | Continuous number | | Clinical trial pass/fail | Binary classification | Binary discrete outcome | | Note similarity score | Regression | Continuous [0, 1] | | Patient cluster membership | Unsupervised clustering | No labels available |
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.