Learnixo

Machine Learning Foundations · Lesson 9 of 70

What is Semi-Supervised Learning?

The Core Idea

Semi-supervised learning uses a small amount of labeled data combined with a large amount of unlabeled data for training.

Supervised:         1,000 labeled examples → train
Semi-supervised:    100 labeled + 10,000 unlabeled → train
Unsupervised:       10,000 unlabeled → train

This matters because labeling is expensive. Clinical notes, medical images, and legal documents require domain experts to annotate — often at high cost and slow turnaround. Unlabeled data is cheap.


Why Unlabeled Data Helps

Unlabeled data provides information about the structure of the input space — even without knowing the labels. If you know that two inputs are similar (they're close in feature space), it's reasonable to assign them the same label.

Labeled:
  "warfarin is an oral anticoagulant" → class: anticoagulant ✓

Unlabeled:
  "coumadin prevents blood clots"        → class: ??? 
  "rivaroxaban is used for DVT prevention" → class: ???

Semi-supervised insight: these sentences are semantically similar to the labeled one
→ probably also anticoagulant

Common Approaches

1. Self-Training (Pseudo-Labeling)

Train on labeled data, predict labels for unlabeled data, then retrain including high-confidence predictions.

Python
from sklearn.svm import SVC
from sklearn.semi_supervised import SelfTrainingClassifier
import numpy as np

# -1 indicates unlabeled in scikit-learn's semi-supervised API
y_train = np.array([0, 1, 2, -1, -1, -1, -1, -1, -1, -1, 0, 1])
#                                ^^^^  ^^^^  ^^^^   unlabeled

X_train = np.random.randn(12, 20)

base_classifier = SVC(probability=True, kernel="rbf")
self_training = SelfTrainingClassifier(base_classifier, threshold=0.8)
self_training.fit(X_train, y_train)

Risk: if initial predictions are wrong, errors compound — "confirmation bias" in the model.


2. Label Propagation

Spread labels through a similarity graph — similar points receive the same label.

Python
from sklearn.semi_supervised import LabelPropagation

# Same API: -1 for unlabeled
model = LabelPropagation(kernel="rbf", gamma=20)
model.fit(X_train, y_train)

# Even unlabeled points get a soft label distribution
label_distributions = model.label_distributions_
# [[0.85, 0.10, 0.05],    85% likely class 0
#  [0.12, 0.76, 0.12],    76% likely class 1
#  ...]

3. Pre-Training on Unlabeled Data (Most Common in NLP)

Use large unlabeled corpora to learn representations, then fine-tune on small labeled datasets. This is how BERT and GPT work.

Pre-training (self-supervised, no labels):
  Train on 100GB of medical text
  Task: predict masked words / next token
  → Learns rich language representations

Fine-tuning (supervised, few labels):
  100 labeled (drug_mention, context) → (drug_class) examples
  Fine-tune pre-trained model on these 100 examples
  → Outperforms models trained on 100 labeled examples from scratch

4. Consistency Regularization

Augment the same input in two different ways — the model's predictions should be consistent across augmentations.

Python
# Conceptually:
def consistency_loss(model, unlabeled_x):
    # Augment the same input differently
    aug1 = add_noise(unlabeled_x, sigma=0.1)
    aug2 = add_noise(unlabeled_x, sigma=0.1)

    pred1 = model(aug1)
    pred2 = model(aug2)

    # Predictions on the same input should agree
    return kl_divergence(pred1, pred2)

Used in UDA (Unsupervised Data Augmentation) and FixMatch.


Real Use Cases

| Domain | Labeled | Unlabeled | Task | |---|---|---|---| | Clinical NLP | 200 annotated notes | 50,000 raw EHR notes | Named entity recognition (drugs, dosages) | | Medical imaging | 500 labeled scans | 10,000 unlabeled scans | Tumor detection | | Drug classification | 100 annotated compounds | 50,000 research papers | Drug-class prediction | | Fraud detection | 100 known fraudulent cases | millions of transactions | Fraud detection | | LLM pre-training | Few human examples (SFT) | Entire internet | Language model |


Semi-Supervised vs Related Paradigms

| Paradigm | Labels Available | Unlabeled Used? | |---|---|---| | Supervised | All (or most) data labeled | No | | Semi-supervised | Small fraction labeled | Yes | | Self-supervised | None (labels auto-generated) | Yes — labels come from data | | Unsupervised | None | Yes | | Transfer learning | Labeled in a different domain | Yes (pre-training data) |

Self-supervised learning (BERT, GPT pre-training) is a special case where the model generates its own supervision signal from unlabeled data (e.g., predict masked tokens). It's distinct from semi-supervised but related.


Interview Answer Template

Q: What is semi-supervised learning and when would you use it?

Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data for training. It's motivated by the fact that labeling is expensive — domain experts are needed for medical, legal, or financial data — while unlabeled data is abundant. Common approaches include self-training (predict labels for unlabeled data, retrain on high-confidence pseudo-labels), label propagation (spread labels through a similarity graph), and pre-training on unlabeled data followed by fine-tuning on labeled examples. The most impactful version in modern AI is BERT and GPT pre-training, which uses massive unlabeled corpora to learn representations that can then be fine-tuned with only hundreds or thousands of labeled examples.