Sampling in Machine Learning

Why Sampling Matters

Every ML pipeline involves sampling decisions:

Which examples to include in training
How to split train/validation/test
How to sample mini-batches during training
How to estimate evaluation metric uncertainty

Poor sampling choices create biased models and misleading evaluations.

Simple Random Sampling

Python

import numpy as np
import pandas as pd

np.random.seed(42)
n = 1000
indices = np.random.choice(n, size=200, replace=False)  # 20% sample

# Pandas
df = pd.DataFrame({"x": range(1000), "y": range(1000)})
sample = df.sample(n=200, random_state=42)

# Train/test split (simple random)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Problem: random split may not preserve class proportions
# If 5% of examples are class 1, test set might have 3% or 8% — misleading evaluation

Stratified Sampling

Preserves class (or group) proportions in each split.

Python

from sklearn.model_selection import train_test_split, StratifiedKFold

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,          # preserve class proportions
    random_state=42,
)

print(f"Train class 1 rate: {y_train.mean():.3f}")
print(f"Test class 1 rate:  {y_test.mean():.3f}")
# Should be nearly identical

# Stratified k-fold (for evaluation of imbalanced datasets)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_fold_train, X_fold_val = X[train_idx], X[val_idx]
    y_fold_train, y_fold_val = y[train_idx], y[val_idx]
    # ... train and evaluate ...

Stratification Beyond Labels

Python

# Stratify by multiple variables simultaneously
from sklearn.model_selection import StratifiedGroupKFold

# Example: preserve hospital ID groups AND class proportions
# Each patient appears in only one fold (no data leakage across hospitals)
sgkf = StratifiedGroupKFold(n_splits=5)
for train_idx, val_idx in sgkf.split(X, y, groups=hospital_ids):
    pass  # hospital-level stratification

# Stratify a continuous variable (e.g., age) by binning
df["age_bin"] = pd.cut(df["age"], bins=[0, 40, 60, 80, 120], labels=False)
X_train, X_test = train_test_split(df, stratify=df["age_bin"], test_size=0.2)

Bootstrap Sampling

Sample with replacement to estimate variance and confidence intervals.

Python

def bootstrap_metric(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    metric_fn,           # e.g., sklearn.metrics.roc_auc_score
    n_bootstrap: int = 1000,
    confidence: float = 0.95,
    seed: int = 42,
) -> dict:
    rng = np.random.RandomState(seed)
    bootstrap_scores = []
    
    n = len(y_true)
    for _ in range(n_bootstrap):
        idx = rng.choice(n, size=n, replace=True)   # sample WITH replacement
        score = metric_fn(y_true[idx], y_pred[idx])
        bootstrap_scores.append(score)
    
    bootstrap_scores = np.array(bootstrap_scores)
    alpha = 1 - confidence
    
    return {
        "mean": float(bootstrap_scores.mean()),
        "std": float(bootstrap_scores.std()),
        "ci_lower": float(np.percentile(bootstrap_scores, alpha / 2 * 100)),
        "ci_upper": float(np.percentile(bootstrap_scores, (1 - alpha / 2) * 100)),
    }

# Usage: AUC with 95% confidence interval
from sklearn.metrics import roc_auc_score
result = bootstrap_metric(y_test, y_pred_proba, roc_auc_score)
print(f"AUC = {result['mean']:.3f} (95% CI: {result['ci_lower']:.3f}–{result['ci_upper']:.3f})")

Imbalanced Classes: Oversampling and Undersampling

Python

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# SMOTE: Synthetic Minority Over-sampling Technique
# Generates synthetic examples of the minority class
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Random undersampling: remove majority class examples
under = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_under, y_under = under.fit_resample(X_train, y_train)

# Combined pipeline
pipeline = Pipeline([
    ("smote", SMOTE(sampling_strategy=0.3, random_state=42)),
    ("under", RandomUnderSampler(sampling_strategy=0.5, random_state=42)),
])
X_combined, y_combined = pipeline.fit_resample(X_train, y_train)

# Note: only resample TRAINING data — never test data
# Test set must reflect true class distribution for valid evaluation

Mini-Batch Sampling During Training

Python

import torch
from torch.utils.data import DataLoader, WeightedRandomSampler

# Weighted sampling for imbalanced datasets
# Each sample's probability is inversely proportional to class frequency
class_counts = np.bincount(y_train)
class_weights = 1.0 / class_counts
sample_weights = class_weights[y_train]

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(y_train),
    replacement=True,
)

loader = DataLoader(
    dataset,
    batch_size=32,
    sampler=sampler,    # overrides shuffle; ensures balanced batches
)

Interview Answer

"Sampling strategy critically affects both training and evaluation. Random splits are fine for balanced datasets; stratified splits are essential for imbalanced classes — otherwise your test set may not reflect the true class distribution and evaluation metrics will be misleading. Bootstrap sampling estimates metric variance and confidence intervals — always report AUC or accuracy as mean ± 95% CI in clinical settings. For class imbalance, SMOTE oversamples the minority class in training only — never resample the test set. During training, weighted random sampling ensures balanced mini-batches so the model sees enough minority class examples per update."

Sampling in Machine Learning

Why Sampling Matters

Simple Random Sampling

Stratified Sampling

Stratification Beyond Labels

Bootstrap Sampling

Imbalanced Classes: Oversampling and Undersampling

Mini-Batch Sampling During Training

Interview Answer

Enjoyed this article?

Leave a comment