What is Unsupervised Learning?
Understand unsupervised learning: clustering, dimensionality reduction, and anomaly detection ā with practical examples using patient clustering, embedding visualization, and drug similarity search.
The Core Idea
In unsupervised learning, the training data has no labels ā you only have inputs (X). The model must find structure, patterns, or groupings on its own.
Supervised: (X, y) ā learn to predict y from X
Unsupervised: (X) ā discover structure within XThe Three Main Tasks
1. Clustering ā Find Natural Groups
Group similar data points together without being told what the groups are.
from sklearn.cluster import KMeans
import numpy as np
# 500 drug embedding vectors ā find natural drug families
drug_embeddings = np.random.randn(500, 128)
# K-Means: partition into k clusters
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(drug_embeddings)
cluster_labels = kmeans.labels_ # Which cluster each drug belongs to
centers = kmeans.cluster_centers_ # Centroid of each cluster
print(cluster_labels[:10]) # [2, 0, 4, 1, 2, 2, 3, 0, 1, 4]
# Predict cluster for a new drug
new_drug_emb = np.random.randn(1, 128)
print(kmeans.predict(new_drug_emb)) # [2] ā assigned to cluster 2Common algorithms: K-Means, DBSCAN (density-based, finds arbitrary shapes), Hierarchical clustering, Gaussian Mixture Models.
2. Dimensionality Reduction ā Compress Representations
Reduce high-dimensional data to fewer dimensions while preserving structure. Used for visualization, denoising, and making downstream ML faster.
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# 1000 drug embeddings, 1536 dimensions ā visualize in 2D
embeddings = np.random.randn(1000, 1536)
# PCA: linear reduction ā fast, good for preprocessing
pca = PCA(n_components=50)
reduced_pca = pca.fit_transform(embeddings) # (1000, 50)
# Variance explained: how much information is retained
explained = sum(pca.explained_variance_ratio_)
print(f"Variance explained by 50 components: {explained:.1%}")
# t-SNE: non-linear, for 2D/3D visualization only
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
reduced_2d = tsne.fit_transform(reduced_pca) # (1000, 2)
# Plot: similar drugs will cluster together visually3. Anomaly Detection ā Find Outliers
Identify data points that don't fit the normal pattern ā without being told what "anomalous" looks like.
from sklearn.ensemble import IsolationForest
# Patient vital signs ā detect abnormal patients without labeled anomalies
vitals = np.random.randn(1000, 5) # Normal patients
# Add some anomalies
vitals[990:] *= 5 # Extreme values
detector = IsolationForest(contamination=0.05, random_state=42)
detector.fit(vitals)
predictions = detector.predict(vitals)
# 1 = normal, -1 = anomaly
anomaly_indices = np.where(predictions == -1)[0]
print(f"Detected {len(anomaly_indices)} anomalies")Unsupervised Learning in AI Systems
Embedding Similarity Search (RAG)
The core of RAG is unsupervised ā no labels tell the model which documents are relevant. Semantic similarity emerges from the embedding model's training, and retrieval groups semantically related documents.
import numpy as np
def find_similar_drugs(query_embedding: np.ndarray, corpus_embeddings: np.ndarray, k: int = 5):
"""Unsupervised: find similar drugs by embedding proximity."""
# Normalize
query_norm = query_embedding / np.linalg.norm(query_embedding)
corpus_norms = np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
corpus_normalized = corpus_embeddings / corpus_norms
similarities = corpus_normalized @ query_norm # (n_drugs,)
top_k = np.argsort(similarities)[-k:][::-1]
return top_k, similarities[top_k]Topic Modeling (LDA, BERTopic)
Discover latent topics in a corpus of clinical notes without labeled topics.
# BERTopic: combines BERT embeddings + HDBSCAN clustering + TF-IDF topic words
# Topic 0: ["warfarin", "INR", "anticoagulant", "bleeding"]
# Topic 1: ["diabetes", "metformin", "glucose", "insulin"]
# Topic 2: ["hypertension", "ACE", "lisinopril", "blood pressure"]When to Use Unsupervised Learning
| Situation | Use Case | |---|---| | No labels available | Patient clustering from EHR data | | Too expensive to label | Finding document clusters before manual labeling | | Exploratory analysis | Understand structure of a new dataset | | Preprocessing | PCA for dimensionality reduction before supervised model | | Anomaly detection | Fraud, sensor failures, out-of-distribution inputs | | Semantic search | Nearest-neighbor retrieval in embedding space |
Limitations
- No ground truth ā hard to know if the clusters are "correct"
- Evaluation is subjective ā metrics like silhouette score exist but don't fully capture usefulness
- Sensitive to scale ā features must be normalized (K-Means uses distances)
- Choosing k ā in K-Means, k is a hyperparameter; wrong choice gives meaningless clusters
Interview Answer Template
Q: What is unsupervised learning?
Unsupervised learning works with unlabeled data ā there's no y, only X. The model finds structure, patterns, or groupings on its own. The three main tasks are clustering (grouping similar points, like K-Means), dimensionality reduction (compressing representations, like PCA or t-SNE), and anomaly detection (finding outliers). It's used when labels are expensive or unavailable, for exploratory analysis, and in AI systems like RAG ā where embedding-based retrieval is fundamentally an unsupervised similarity search. The main challenge is evaluation: without labels, it's harder to measure whether the discovered structure is meaningful.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.