Sampling in Machine Learning
How sampling strategies — random, stratified, systematic, and bootstrap — affect model training and evaluation, with practical implementation.
Why Sampling Matters
Every ML pipeline involves sampling decisions:
- Which examples to include in training
- How to split train/validation/test
- How to sample mini-batches during training
- How to estimate evaluation metric uncertainty
Poor sampling choices create biased models and misleading evaluations.
Simple Random Sampling
import numpy as np
import pandas as pd
np.random.seed(42)
n = 1000
indices = np.random.choice(n, size=200, replace=False) # 20% sample
# Pandas
df = pd.DataFrame({"x": range(1000), "y": range(1000)})
sample = df.sample(n=200, random_state=42)
# Train/test split (simple random)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Problem: random split may not preserve class proportions
# If 5% of examples are class 1, test set might have 3% or 8% — misleading evaluationStratified Sampling
Preserves class (or group) proportions in each split.
from sklearn.model_selection import train_test_split, StratifiedKFold
# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
stratify=y, # preserve class proportions
random_state=42,
)
print(f"Train class 1 rate: {y_train.mean():.3f}")
print(f"Test class 1 rate: {y_test.mean():.3f}")
# Should be nearly identical
# Stratified k-fold (for evaluation of imbalanced datasets)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
X_fold_train, X_fold_val = X[train_idx], X[val_idx]
y_fold_train, y_fold_val = y[train_idx], y[val_idx]
# ... train and evaluate ...Stratification Beyond Labels
# Stratify by multiple variables simultaneously
from sklearn.model_selection import StratifiedGroupKFold
# Example: preserve hospital ID groups AND class proportions
# Each patient appears in only one fold (no data leakage across hospitals)
sgkf = StratifiedGroupKFold(n_splits=5)
for train_idx, val_idx in sgkf.split(X, y, groups=hospital_ids):
pass # hospital-level stratification
# Stratify a continuous variable (e.g., age) by binning
df["age_bin"] = pd.cut(df["age"], bins=[0, 40, 60, 80, 120], labels=False)
X_train, X_test = train_test_split(df, stratify=df["age_bin"], test_size=0.2)Bootstrap Sampling
Sample with replacement to estimate variance and confidence intervals.
def bootstrap_metric(
y_true: np.ndarray,
y_pred: np.ndarray,
metric_fn, # e.g., sklearn.metrics.roc_auc_score
n_bootstrap: int = 1000,
confidence: float = 0.95,
seed: int = 42,
) -> dict:
rng = np.random.RandomState(seed)
bootstrap_scores = []
n = len(y_true)
for _ in range(n_bootstrap):
idx = rng.choice(n, size=n, replace=True) # sample WITH replacement
score = metric_fn(y_true[idx], y_pred[idx])
bootstrap_scores.append(score)
bootstrap_scores = np.array(bootstrap_scores)
alpha = 1 - confidence
return {
"mean": float(bootstrap_scores.mean()),
"std": float(bootstrap_scores.std()),
"ci_lower": float(np.percentile(bootstrap_scores, alpha / 2 * 100)),
"ci_upper": float(np.percentile(bootstrap_scores, (1 - alpha / 2) * 100)),
}
# Usage: AUC with 95% confidence interval
from sklearn.metrics import roc_auc_score
result = bootstrap_metric(y_test, y_pred_proba, roc_auc_score)
print(f"AUC = {result['mean']:.3f} (95% CI: {result['ci_lower']:.3f}–{result['ci_upper']:.3f})")Imbalanced Classes: Oversampling and Undersampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
# SMOTE: Synthetic Minority Over-sampling Technique
# Generates synthetic examples of the minority class
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Random undersampling: remove majority class examples
under = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_under, y_under = under.fit_resample(X_train, y_train)
# Combined pipeline
pipeline = Pipeline([
("smote", SMOTE(sampling_strategy=0.3, random_state=42)),
("under", RandomUnderSampler(sampling_strategy=0.5, random_state=42)),
])
X_combined, y_combined = pipeline.fit_resample(X_train, y_train)
# Note: only resample TRAINING data — never test data
# Test set must reflect true class distribution for valid evaluationMini-Batch Sampling During Training
import torch
from torch.utils.data import DataLoader, WeightedRandomSampler
# Weighted sampling for imbalanced datasets
# Each sample's probability is inversely proportional to class frequency
class_counts = np.bincount(y_train)
class_weights = 1.0 / class_counts
sample_weights = class_weights[y_train]
sampler = WeightedRandomSampler(
weights=sample_weights,
num_samples=len(y_train),
replacement=True,
)
loader = DataLoader(
dataset,
batch_size=32,
sampler=sampler, # overrides shuffle; ensures balanced batches
)Interview Answer
"Sampling strategy critically affects both training and evaluation. Random splits are fine for balanced datasets; stratified splits are essential for imbalanced classes — otherwise your test set may not reflect the true class distribution and evaluation metrics will be misleading. Bootstrap sampling estimates metric variance and confidence intervals — always report AUC or accuracy as mean ± 95% CI in clinical settings. For class imbalance, SMOTE oversamples the minority class in training only — never resample the test set. During training, weighted random sampling ensures balanced mini-batches so the model sees enough minority class examples per update."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.