Machine Learning Foundations · Lesson 7 of 70

What is Unsupervised Learning?

The Core Idea

In unsupervised learning, the training data has no labels — you only have inputs (X). The model must find structure, patterns, or groupings on its own.

Supervised:   (X, y) → learn to predict y from X
Unsupervised: (X)    → discover structure within X

The Three Main Tasks

1. Clustering — Find Natural Groups

Group similar data points together without being told what the groups are.

Python

from sklearn.cluster import KMeans
import numpy as np

# 500 drug embedding vectors — find natural drug families
drug_embeddings = np.random.randn(500, 128)

# K-Means: partition into k clusters
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(drug_embeddings)

cluster_labels = kmeans.labels_   # Which cluster each drug belongs to
centers = kmeans.cluster_centers_ # Centroid of each cluster

print(cluster_labels[:10])   # [2, 0, 4, 1, 2, 2, 3, 0, 1, 4]

# Predict cluster for a new drug
new_drug_emb = np.random.randn(1, 128)
print(kmeans.predict(new_drug_emb))   # [2] — assigned to cluster 2

Common algorithms: K-Means, DBSCAN (density-based, finds arbitrary shapes), Hierarchical clustering, Gaussian Mixture Models.

2. Dimensionality Reduction — Compress Representations

Reduce high-dimensional data to fewer dimensions while preserving structure. Used for visualization, denoising, and making downstream ML faster.

Python

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# 1000 drug embeddings, 1536 dimensions → visualize in 2D
embeddings = np.random.randn(1000, 1536)

# PCA: linear reduction — fast, good for preprocessing
pca = PCA(n_components=50)
reduced_pca = pca.fit_transform(embeddings)   # (1000, 50)

# Variance explained: how much information is retained
explained = sum(pca.explained_variance_ratio_)
print(f"Variance explained by 50 components: {explained:.1%}")

# t-SNE: non-linear, for 2D/3D visualization only
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
reduced_2d = tsne.fit_transform(reduced_pca)   # (1000, 2)
# Plot: similar drugs will cluster together visually

3. Anomaly Detection — Find Outliers

Identify data points that don't fit the normal pattern — without being told what "anomalous" looks like.

Python

from sklearn.ensemble import IsolationForest

# Patient vital signs — detect abnormal patients without labeled anomalies
vitals = np.random.randn(1000, 5)   # Normal patients
# Add some anomalies
vitals[990:] *= 5   # Extreme values

detector = IsolationForest(contamination=0.05, random_state=42)
detector.fit(vitals)

predictions = detector.predict(vitals)
# 1 = normal, -1 = anomaly
anomaly_indices = np.where(predictions == -1)[0]
print(f"Detected {len(anomaly_indices)} anomalies")

Unsupervised Learning in AI Systems

Embedding Similarity Search (RAG)

The core of RAG is unsupervised — no labels tell the model which documents are relevant. Semantic similarity emerges from the embedding model's training, and retrieval groups semantically related documents.

Python

import numpy as np

def find_similar_drugs(query_embedding: np.ndarray, corpus_embeddings: np.ndarray, k: int = 5):
    """Unsupervised: find similar drugs by embedding proximity."""
    # Normalize
    query_norm = query_embedding / np.linalg.norm(query_embedding)
    corpus_norms = np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
    corpus_normalized = corpus_embeddings / corpus_norms

    similarities = corpus_normalized @ query_norm   # (n_drugs,)
    top_k = np.argsort(similarities)[-k:][::-1]
    return top_k, similarities[top_k]

Topic Modeling (LDA, BERTopic)

Discover latent topics in a corpus of clinical notes without labeled topics.

Python

# BERTopic: combines BERT embeddings + HDBSCAN clustering + TF-IDF topic words
# Topic 0: ["warfarin", "INR", "anticoagulant", "bleeding"]
# Topic 1: ["diabetes", "metformin", "glucose", "insulin"]
# Topic 2: ["hypertension", "ACE", "lisinopril", "blood pressure"]

When to Use Unsupervised Learning

| Situation | Use Case | |---|---| | No labels available | Patient clustering from EHR data | | Too expensive to label | Finding document clusters before manual labeling | | Exploratory analysis | Understand structure of a new dataset | | Preprocessing | PCA for dimensionality reduction before supervised model | | Anomaly detection | Fraud, sensor failures, out-of-distribution inputs | | Semantic search | Nearest-neighbor retrieval in embedding space |

Limitations

No ground truth — hard to know if the clusters are "correct"
Evaluation is subjective — metrics like silhouette score exist but don't fully capture usefulness
Sensitive to scale — features must be normalized (K-Means uses distances)
Choosing k — in K-Means, k is a hyperparameter; wrong choice gives meaningless clusters

Interview Answer Template

Q: What is unsupervised learning?

Unsupervised learning works with unlabeled data — there's no y, only X. The model finds structure, patterns, or groupings on its own. The three main tasks are clustering (grouping similar points, like K-Means), dimensionality reduction (compressing representations, like PCA or t-SNE), and anomaly detection (finding outliers). It's used when labels are expensive or unavailable, for exploratory analysis, and in AI systems like RAG — where embedding-based retrieval is fundamentally an unsupervised similarity search. The main challenge is evaluation: without labels, it's harder to measure whether the discovered structure is meaningful.

What is Supervised Learning?

Next Lesson

What is Reinforcement Learning?