Machine Learning Foundations · Lesson 9 of 70
What is Semi-Supervised Learning?
The Core Idea
Semi-supervised learning uses a small amount of labeled data combined with a large amount of unlabeled data for training.
Supervised: 1,000 labeled examples → train
Semi-supervised: 100 labeled + 10,000 unlabeled → train
Unsupervised: 10,000 unlabeled → trainThis matters because labeling is expensive. Clinical notes, medical images, and legal documents require domain experts to annotate — often at high cost and slow turnaround. Unlabeled data is cheap.
Why Unlabeled Data Helps
Unlabeled data provides information about the structure of the input space — even without knowing the labels. If you know that two inputs are similar (they're close in feature space), it's reasonable to assign them the same label.
Labeled:
"warfarin is an oral anticoagulant" → class: anticoagulant ✓
Unlabeled:
"coumadin prevents blood clots" → class: ???
"rivaroxaban is used for DVT prevention" → class: ???
Semi-supervised insight: these sentences are semantically similar to the labeled one
→ probably also anticoagulantCommon Approaches
1. Self-Training (Pseudo-Labeling)
Train on labeled data, predict labels for unlabeled data, then retrain including high-confidence predictions.
from sklearn.svm import SVC
from sklearn.semi_supervised import SelfTrainingClassifier
import numpy as np
# -1 indicates unlabeled in scikit-learn's semi-supervised API
y_train = np.array([0, 1, 2, -1, -1, -1, -1, -1, -1, -1, 0, 1])
# ^^^^ ^^^^ ^^^^ unlabeled
X_train = np.random.randn(12, 20)
base_classifier = SVC(probability=True, kernel="rbf")
self_training = SelfTrainingClassifier(base_classifier, threshold=0.8)
self_training.fit(X_train, y_train)Risk: if initial predictions are wrong, errors compound — "confirmation bias" in the model.
2. Label Propagation
Spread labels through a similarity graph — similar points receive the same label.
from sklearn.semi_supervised import LabelPropagation
# Same API: -1 for unlabeled
model = LabelPropagation(kernel="rbf", gamma=20)
model.fit(X_train, y_train)
# Even unlabeled points get a soft label distribution
label_distributions = model.label_distributions_
# [[0.85, 0.10, 0.05], → 85% likely class 0
# [0.12, 0.76, 0.12], → 76% likely class 1
# ...]3. Pre-Training on Unlabeled Data (Most Common in NLP)
Use large unlabeled corpora to learn representations, then fine-tune on small labeled datasets. This is how BERT and GPT work.
Pre-training (self-supervised, no labels):
Train on 100GB of medical text
Task: predict masked words / next token
→ Learns rich language representations
Fine-tuning (supervised, few labels):
100 labeled (drug_mention, context) → (drug_class) examples
Fine-tune pre-trained model on these 100 examples
→ Outperforms models trained on 100 labeled examples from scratch4. Consistency Regularization
Augment the same input in two different ways — the model's predictions should be consistent across augmentations.
# Conceptually:
def consistency_loss(model, unlabeled_x):
# Augment the same input differently
aug1 = add_noise(unlabeled_x, sigma=0.1)
aug2 = add_noise(unlabeled_x, sigma=0.1)
pred1 = model(aug1)
pred2 = model(aug2)
# Predictions on the same input should agree
return kl_divergence(pred1, pred2)Used in UDA (Unsupervised Data Augmentation) and FixMatch.
Real Use Cases
| Domain | Labeled | Unlabeled | Task | |---|---|---|---| | Clinical NLP | 200 annotated notes | 50,000 raw EHR notes | Named entity recognition (drugs, dosages) | | Medical imaging | 500 labeled scans | 10,000 unlabeled scans | Tumor detection | | Drug classification | 100 annotated compounds | 50,000 research papers | Drug-class prediction | | Fraud detection | 100 known fraudulent cases | millions of transactions | Fraud detection | | LLM pre-training | Few human examples (SFT) | Entire internet | Language model |
Semi-Supervised vs Related Paradigms
| Paradigm | Labels Available | Unlabeled Used? | |---|---|---| | Supervised | All (or most) data labeled | No | | Semi-supervised | Small fraction labeled | Yes | | Self-supervised | None (labels auto-generated) | Yes — labels come from data | | Unsupervised | None | Yes | | Transfer learning | Labeled in a different domain | Yes (pre-training data) |
Self-supervised learning (BERT, GPT pre-training) is a special case where the model generates its own supervision signal from unlabeled data (e.g., predict masked tokens). It's distinct from semi-supervised but related.
Interview Answer Template
Q: What is semi-supervised learning and when would you use it?
Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data for training. It's motivated by the fact that labeling is expensive — domain experts are needed for medical, legal, or financial data — while unlabeled data is abundant. Common approaches include self-training (predict labels for unlabeled data, retrain on high-confidence pseudo-labels), label propagation (spread labels through a similarity graph), and pre-training on unlabeled data followed by fine-tuning on labeled examples. The most impactful version in modern AI is BERT and GPT pre-training, which uses massive unlabeled corpora to learn representations that can then be fine-tuned with only hundreds or thousands of labeled examples.