Interview: When to Use Supervised vs Unsupervised?

The Decision Framework

The primary question is simple: do you have labels?

Do you have labeled training data?
  Yes → Supervised learning (classification or regression)
  No  → Unsupervised learning (clustering, dimensionality reduction, anomaly detection)
  Small amount → Semi-supervised learning or pre-train then fine-tune

But interviews go deeper. Here's what interviewers actually want to hear.

Q1: You Have a Dataset of 100,000 Clinical Notes. Your Goal is to Tag Each Note With a Drug Class Mentioned. No Labels Exist. What Do You Do?

Trap: Jump straight to "use an NLP model."

Strong answer:

Step 1 — Check what labels would cost. Could you annotate 500 notes in a week? If yes, that might unlock supervised learning with a pre-trained LLM.

Step 2 — Unsupervised first for exploration: cluster the notes using embeddings (sentence-transformers → K-Means or HDBSCAN). Inspect clusters — they may naturally correspond to drug classes.

Step 3 — Semi-supervised: annotate 200 notes, use self-training or label propagation to propagate labels to the remaining 99,800.

Step 4 — If a pre-trained model exists for this domain (e.g., BioBERT, ClinicalBERT), use zero-shot or few-shot classification — no custom labels needed for initial deployment.

Python

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

# Step 1: Embed all notes
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(clinical_notes)   # (100000, 384)

# Step 2: Cluster
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(embeddings)
labels = kmeans.labels_

# Step 3: Inspect cluster centers — what drugs appear in each cluster?

Q2: You're Building a Fraud Detection System. You Have 1 Million Transactions, and 50 Are Labeled as Fraudulent. What Approach?

Key insight: 50 labels out of 1 million is extreme class imbalance — supervised learning will struggle.

Strong answer:

Anomaly detection (unsupervised): train on the normal transactions (the majority), flag statistically rare transactions. Isolation Forest, One-Class SVM, Autoencoder.
Supervised with SMOTE: oversample the minority class, then train XGBoost or LightGBM.
Combine both: use anomaly scores as a feature in a supervised model — the anomaly score provides useful signal even when labels are scarce.

Python

from sklearn.ensemble import IsolationForest

# Train only on "normal" data — no fraud needed
normal_transactions = transactions[~fraud_mask]
detector = IsolationForest(contamination=0.01, random_state=42)
detector.fit(normal_transactions)

# Predict for all transactions
scores = detector.decision_function(all_transactions)
# Negative score = more anomalous
flagged = detector.predict(all_transactions) == -1

Q3: You Have 10,000 Drug Embeddings From a Pre-Trained Model. You Want to Find Drugs That Are Mechanistically Similar. Which Approach?

This is clearly unsupervised — you don't have labels for "drug similarity," and it's a discovery task.

Strong answer:

Cosine similarity + nearest-neighbor search for individual queries
K-Means or HDBSCAN for global clustering
PCA or UMAP for visualization
Hierarchical clustering if you want a dendrogram showing similarity at multiple scales

Why not supervised? There are no ground-truth "similar drug" labels. Mechanistic similarity is the thing you're trying to discover, not something you already know.

Q4: Your Task is to Predict Whether a Patient Will Be Readmitted Within 30 Days. You Have 5 Years of EHR Data With Readmission Outcomes. Which Approach?

This is clearly supervised — you have the outcome label (readmitted: yes/no) for historical patients.

Strong answer:

Binary classification task: y = readmitted within 30 days (1 or 0)
Features: age, diagnosis codes, procedure codes, lab values, medication count, prior admissions
Model: gradient boosting (XGBoost/LightGBM) often wins on structured EHR data
Key concerns: temporal leakage (don't use features from after admission), class imbalance (readmission is a minority event)

Python

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

# CRITICAL: split by time, not randomly — prevents temporal leakage
train_df = ehr_df[ehr_df.year < 2023]
test_df  = ehr_df[ehr_df.year >= 2023]

X_train, y_train = train_df[features], train_df["readmitted_30d"]
X_test,  y_test  = test_df[features],  test_df["readmitted_30d"]

model = XGBClassifier(scale_pos_weight=10)   # Handle class imbalance
model.fit(X_train, y_train)
print(f"AUC-ROC: {roc_auc_score(y_test, model.predict_proba(X_test)[:,1]):.3f}")

Q5: You're Fine-Tuning an LLM for a Clinical Q&A System. What Paradigm Is This?

Strong answer:

This is a hybrid — and interviewers love this nuance:

Pre-training (self-supervised): the base LLM was trained on unlabeled text — next-token prediction. Technically unsupervised but often called self-supervised.
Supervised Fine-Tuning (SFT): you provide (clinical_question, ideal_answer) pairs. Standard supervised learning — the model minimizes cross-entropy loss on the target tokens.
RLHF (optional): reward model trained on human preference pairs → PPO optimization. This is reinforcement learning, but it uses a supervised reward model as the signal.

Decision Cheat Sheet

| Situation | Paradigm | Why | |---|---|---| | Labeled examples of the target task | Supervised | Known input-output mapping | | No labels, need to discover structure | Unsupervised | No supervision available | | Few labels, lots of unlabeled data | Semi-supervised | Leverage unlabeled data | | Domain-specific task, pre-trained model exists | Transfer learning + fine-tune | Cheaper than training from scratch | | Extreme class imbalance (rare events) | Anomaly detection or combined | Supervised struggles with rare class | | LLM alignment to human preferences | RLHF | Labels are rankings, not direct answers | | LLM pre-training | Self-supervised | Labels auto-generated from text |

Interview: When to Use Supervised vs Unsupervised?

The Decision Framework

Q1: You Have a Dataset of 100,000 Clinical Notes. Your Goal is to Tag Each Note With a Drug Class Mentioned. No Labels Exist. What Do You Do?

Q2: You're Building a Fraud Detection System. You Have 1 Million Transactions, and 50 Are Labeled as Fraudulent. What Approach?

Q3: You Have 10,000 Drug Embeddings From a Pre-Trained Model. You Want to Find Drugs That Are Mechanistically Similar. Which Approach?

Q4: Your Task is to Predict Whether a Patient Will Be Readmitted Within 30 Days. You Have 5 Years of EHR Data With Readmission Outcomes. Which Approach?

Q5: You're Fine-Tuning an LLM for a Clinical Q&A System. What Paradigm Is This?

Decision Cheat Sheet

Enjoyed this article?

Leave a comment