Learnixo
Back to blog
AI Systemsintermediate

Sampling in Machine Learning

How sampling strategies — random, stratified, systematic, and bootstrap — affect model training and evaluation, with practical implementation.

Asma Hafeez KhanMay 21, 20264 min read
StatisticsSamplingStratifiedBootstrapTrain-Test SplitInterview
Share:𝕏

Why Sampling Matters

Every ML pipeline involves sampling decisions:

  • Which examples to include in training
  • How to split train/validation/test
  • How to sample mini-batches during training
  • How to estimate evaluation metric uncertainty

Poor sampling choices create biased models and misleading evaluations.


Simple Random Sampling

Python
import numpy as np
import pandas as pd

np.random.seed(42)
n = 1000
indices = np.random.choice(n, size=200, replace=False)  # 20% sample

# Pandas
df = pd.DataFrame({"x": range(1000), "y": range(1000)})
sample = df.sample(n=200, random_state=42)

# Train/test split (simple random)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Problem: random split may not preserve class proportions
# If 5% of examples are class 1, test set might have 3% or 8%  misleading evaluation

Stratified Sampling

Preserves class (or group) proportions in each split.

Python
from sklearn.model_selection import train_test_split, StratifiedKFold

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,          # preserve class proportions
    random_state=42,
)

print(f"Train class 1 rate: {y_train.mean():.3f}")
print(f"Test class 1 rate:  {y_test.mean():.3f}")
# Should be nearly identical

# Stratified k-fold (for evaluation of imbalanced datasets)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_fold_train, X_fold_val = X[train_idx], X[val_idx]
    y_fold_train, y_fold_val = y[train_idx], y[val_idx]
    # ... train and evaluate ...

Stratification Beyond Labels

Python
# Stratify by multiple variables simultaneously
from sklearn.model_selection import StratifiedGroupKFold

# Example: preserve hospital ID groups AND class proportions
# Each patient appears in only one fold (no data leakage across hospitals)
sgkf = StratifiedGroupKFold(n_splits=5)
for train_idx, val_idx in sgkf.split(X, y, groups=hospital_ids):
    pass  # hospital-level stratification

# Stratify a continuous variable (e.g., age) by binning
df["age_bin"] = pd.cut(df["age"], bins=[0, 40, 60, 80, 120], labels=False)
X_train, X_test = train_test_split(df, stratify=df["age_bin"], test_size=0.2)

Bootstrap Sampling

Sample with replacement to estimate variance and confidence intervals.

Python
def bootstrap_metric(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    metric_fn,           # e.g., sklearn.metrics.roc_auc_score
    n_bootstrap: int = 1000,
    confidence: float = 0.95,
    seed: int = 42,
) -> dict:
    rng = np.random.RandomState(seed)
    bootstrap_scores = []
    
    n = len(y_true)
    for _ in range(n_bootstrap):
        idx = rng.choice(n, size=n, replace=True)   # sample WITH replacement
        score = metric_fn(y_true[idx], y_pred[idx])
        bootstrap_scores.append(score)
    
    bootstrap_scores = np.array(bootstrap_scores)
    alpha = 1 - confidence
    
    return {
        "mean": float(bootstrap_scores.mean()),
        "std": float(bootstrap_scores.std()),
        "ci_lower": float(np.percentile(bootstrap_scores, alpha / 2 * 100)),
        "ci_upper": float(np.percentile(bootstrap_scores, (1 - alpha / 2) * 100)),
    }

# Usage: AUC with 95% confidence interval
from sklearn.metrics import roc_auc_score
result = bootstrap_metric(y_test, y_pred_proba, roc_auc_score)
print(f"AUC = {result['mean']:.3f} (95% CI: {result['ci_lower']:.3f}–{result['ci_upper']:.3f})")

Imbalanced Classes: Oversampling and Undersampling

Python
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# SMOTE: Synthetic Minority Over-sampling Technique
# Generates synthetic examples of the minority class
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Random undersampling: remove majority class examples
under = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_under, y_under = under.fit_resample(X_train, y_train)

# Combined pipeline
pipeline = Pipeline([
    ("smote", SMOTE(sampling_strategy=0.3, random_state=42)),
    ("under", RandomUnderSampler(sampling_strategy=0.5, random_state=42)),
])
X_combined, y_combined = pipeline.fit_resample(X_train, y_train)

# Note: only resample TRAINING data  never test data
# Test set must reflect true class distribution for valid evaluation

Mini-Batch Sampling During Training

Python
import torch
from torch.utils.data import DataLoader, WeightedRandomSampler

# Weighted sampling for imbalanced datasets
# Each sample's probability is inversely proportional to class frequency
class_counts = np.bincount(y_train)
class_weights = 1.0 / class_counts
sample_weights = class_weights[y_train]

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(y_train),
    replacement=True,
)

loader = DataLoader(
    dataset,
    batch_size=32,
    sampler=sampler,    # overrides shuffle; ensures balanced batches
)

Interview Answer

"Sampling strategy critically affects both training and evaluation. Random splits are fine for balanced datasets; stratified splits are essential for imbalanced classes — otherwise your test set may not reflect the true class distribution and evaluation metrics will be misleading. Bootstrap sampling estimates metric variance and confidence intervals — always report AUC or accuracy as mean ± 95% CI in clinical settings. For class imbalance, SMOTE oversamples the minority class in training only — never resample the test set. During training, weighted random sampling ensures balanced mini-batches so the model sees enough minority class examples per update."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.