Statistics Foundations for Data Engineers

Why Statistics Matters for Data Engineers

Most engineers treat statistics as a data scientist's problem. That's a mistake. Every time you write a pipeline, you're making statistical decisions — whether you know it or not.

Consider three real scenarios:

Scenario 1 — Data Quality: Your daily order ingestion pipeline loaded 98,000 rows. Yesterday it loaded 102,000. Is this normal variance, or did something break upstream? You can't answer that without knowing the distribution of daily row counts.

Scenario 2 — Cost Anomalies: Your cloud spend report shows an average of $12,400/day this month. But last Tuesday it spiked to $31,000. Mean-based alerting missed it. Percentile-based alerting would have caught it immediately.

Scenario 3 — Sampling: You're validating a new transformation on 500M rows. Testing all rows takes 4 hours. A statistically valid sample could give you the same confidence in 3 minutes — if you sample correctly.

This guide gives you the statistical foundation to handle all three.

Descriptive Statistics: When Each Measure Lies

Mean, Median, Mode

These three measures all describe the "center" of a distribution, but they tell completely different stories.

Python

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulate daily row counts for a pipeline over 90 days
# Most days: 95k-105k rows. Occasionally: incident days with 10k rows.
np.random.seed(42)
normal_days = np.random.normal(loc=100_000, scale=3_000, size=85)
incident_days = np.random.normal(loc=10_000, scale=2_000, size=5)
row_counts = np.concatenate([normal_days, incident_days])
np.random.shuffle(row_counts)
row_counts = row_counts.astype(int)

mean_val   = np.mean(row_counts)
median_val = np.median(row_counts)
mode_val   = stats.mode(row_counts // 5000 * 5000).mode[0]  # binned mode

print(f"Mean:   {mean_val:,.0f}")
print(f"Median: {median_val:,.0f}")
print(f"Mode:   {mode_val:,.0f} (approx)")

Output:

Mean:   94,722
Median: 99,841
Mode:   100,000 (approx)

The mean is dragged down by the 5 incident days. If you alert when value < mean * 0.8, you'll never trigger on incident days because the mean itself is suppressed. Always use median for skewed operational data.

Rule of thumb for data engineers:

Mean: use when data is symmetric and you care about total (e.g., computing average latency for capacity planning)
Median: use for skewed data, cost distributions, row counts with occasional incidents
Mode: use for categorical distributions (e.g., which file format is most common in an S3 prefix)

Variance, Standard Deviation, and What They Tell You

Python

variance = np.var(row_counts)
std_dev   = np.std(row_counts)

print(f"Variance:           {variance:,.0f}")
print(f"Std Deviation:      {std_dev:,.0f}")
print(f"Coefficient of Var: {std_dev / mean_val:.2%}")

The coefficient of variation (std/mean) is more useful than raw std for pipeline monitoring. A std of 5,000 rows means something different for a pipeline averaging 10,000 rows versus 1,000,000 rows.

Practical use: If your pipeline's CV (coefficient of variation) is consistently below 5%, you have a stable, predictable pipeline. If it jumps to 20%+, investigate your upstream sources.

Percentiles and IQR: The Engineer's Best Friends

Python

# Percentile-based understanding of your data
p25  = np.percentile(row_counts, 25)
p50  = np.percentile(row_counts, 50)   # same as median
p75  = np.percentile(row_counts, 75)
p90  = np.percentile(row_counts, 90)
p95  = np.percentile(row_counts, 95)
p99  = np.percentile(row_counts, 99)

iqr = p75 - p25  # Interquartile range

print(f"P25: {p25:,.0f}  |  P50: {p50:,.0f}  |  P75: {p75:,.0f}")
print(f"P90: {p90:,.0f}  |  P95: {p95:,.0f}  |  P99: {p99:,.0f}")
print(f"IQR: {iqr:,.0f}")

# Tukey fences — classic outlier detection
lower_fence = p25 - 1.5 * iqr
upper_fence = p75 + 1.5 * iqr
outliers = row_counts[(row_counts < lower_fence) | (row_counts > upper_fence)]
print(f"\nOutlier fence: [{lower_fence:,.0f}, {upper_fence:,.0f}]")
print(f"Outliers detected: {len(outliers)} days")
print(f"Outlier values: {sorted(outliers)[:5]}")

Why IQR beats std for alerting: Standard deviation is influenced by the very outliers you're trying to detect — a feedback loop that makes it less sensitive. IQR uses only the middle 50% of your data, making it robust to exactly the kind of anomalies you care about.

Skewness: Know What You're Working With

Python

from scipy.stats import skew, kurtosis

# Cloud cost data — classic right skew (most days cheap, rare expensive days)
np.random.seed(0)
costs = np.concatenate([
    np.random.normal(1200, 150, 80),   # normal operations
    np.random.exponential(500, 15),     # cost spikes
    [8000, 12000]                        # major incidents
])

print(f"Skewness:  {skew(costs):.3f}")
print(f"Kurtosis:  {kurtosis(costs):.3f}")

Interpreting skewness for data engineers:

0: symmetric — mean equals median, std alerting is valid
> 1: right-skewed — use log transform or percentile-based thresholds
< -1: left-skewed — rare in DE, but watch for data with hard upper caps

Cost data is almost always right-skewed. Never use mean ± 2*std for cost alerting. Use median + 2*IQR instead.

Understanding Distributions: Which Data Follows Which

Normal Distribution

Python

from scipy.stats import norm, lognorm, poisson

# Normal: query latencies in a well-tuned system
latencies_ms = norm.rvs(loc=45, scale=8, size=10000)
# Reasonable to use mean ± 2*std for alerting here

# Test for normality (Shapiro-Wilk — valid up to ~5000 samples)
from scipy.stats import shapiro
stat, p_value = shapiro(latencies_ms[:500])
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
print(f"Normal? {'Yes' if p_value > 0.05 else 'No'}")

Normal distribution fits well for: query latencies in stable systems, row count deltas day-over-day in mature pipelines, measurement errors in sensors.

Log-Normal Distribution

Python

# Log-normal: file sizes, cost data, query durations with outliers
file_sizes_mb = lognorm.rvs(s=1.2, scale=np.exp(3), size=10000)

# Key: take log first, then the normal rules apply
log_sizes = np.log(file_sizes_mb)
print(f"Raw mean: {np.mean(file_sizes_mb):.1f} MB  |  Raw median: {np.median(file_sizes_mb):.1f} MB")
print(f"Log-space mean: {np.mean(log_sizes):.2f}  |  Log-space std: {np.std(log_sizes):.2f}")

# Convert back for human-readable thresholds
alert_threshold = np.exp(np.mean(log_sizes) + 3 * np.std(log_sizes))
print(f"P99.7 threshold (log-normal): {alert_threshold:.1f} MB")

Log-normal distribution fits well for: S3 file sizes, Spark task durations, cloud costs per service, API response body sizes.

Poisson Distribution

Python

# Poisson: event counts in fixed time windows
# Events per minute in an event stream with average rate 120/min
event_counts = poisson.rvs(mu=120, size=1440)  # one day of minute buckets

print(f"Expected mean: 120  |  Actual mean: {np.mean(event_counts):.1f}")
print(f"Expected std:  {np.sqrt(120):.2f}  |  Actual std: {np.std(event_counts):.2f}")

# For Poisson, variance == mean. Good test for whether your event stream is healthy.
variance_to_mean = np.var(event_counts) / np.mean(event_counts)
print(f"Variance/Mean ratio: {variance_to_mean:.3f}")
print(f"Overdispersed? {'Yes — investigate duplicates or bursts' if variance_to_mean > 1.5 else 'No — stream looks healthy'}")

Poisson distribution fits well for: API calls per minute, Kafka message counts in time windows, database inserts per second, error counts per hour.

The Central Limit Theorem: Why It Matters for Pipeline Monitoring

The CLT states that the distribution of sample means approaches a normal distribution regardless of the underlying distribution's shape — given sufficient sample size (usually n ≥ 30).

Why this matters for data engineers:

Python

# Your pipeline row counts are right-skewed (incident days skew it)
# But the distribution of DAILY AVERAGES over a rolling 7-day window is normal
# This lets you apply mean ± 2*std alerting to rolling averages safely

np.random.seed(42)
skewed_daily_counts = np.concatenate([
    np.random.exponential(scale=90_000, size=300),
    np.random.normal(100_000, 5000, size=65)
])

# Rolling 7-day means — CLT kicks in
window_size = 7
rolling_means = [
    np.mean(skewed_daily_counts[i:i+window_size])
    for i in range(len(skewed_daily_counts) - window_size)
]

print(f"Raw skewness:            {skew(skewed_daily_counts):.3f}")
print(f"Rolling mean skewness:   {skew(rolling_means):.3f}")
# Rolling means will be close to 0 — near-normal

Practical takeaway: Don't alert on single-day values. Alert on 7-day rolling averages. The CLT gives you much more reliable statistical thresholds with far fewer false positives.

Hypothesis Testing for Data Engineers

The Core Concepts

Null hypothesis (H0):  "Nothing changed — any difference is random variance"
Alt hypothesis  (H1):  "Something actually changed"

p-value:  Probability of seeing this result IF H0 were true
          Small p (< 0.05) → reject H0 → something likely changed

Type I error:  False positive — you say something changed when it didn't
               Controlled by your significance threshold (alpha = 0.05)

Type II error: False negative — you miss a real change
               Controlled by sample size (power analysis)

Practical Example: Did the New ETL Change Data Distribution?

Python

from scipy.stats import ttest_ind, mannwhitneyu

# Row counts before and after a pipeline refactor
np.random.seed(1)
before_refactor = np.random.normal(loc=95_000, scale=4_000, size=30)
after_refactor  = np.random.normal(loc=97_500, scale=4_200, size=30)

# Parametric test (assumes normality)
t_stat, p_val = ttest_ind(before_refactor, after_refactor)
print(f"T-test p-value: {p_val:.4f}")
print(f"Conclusion: {'Significant difference detected' if p_val < 0.05 else 'No significant difference'}")

# Non-parametric alternative (safer for skewed data)
u_stat, p_val_mw = mannwhitneyu(before_refactor, after_refactor, alternative='two-sided')
print(f"Mann-Whitney p-value: {p_val_mw:.4f}")

When to use which test:

Data is normally distributed and sample size is large → t-test
Data is skewed (costs, file sizes, latencies) → Mann-Whitney U (non-parametric)
Comparing proportions (null rates, error rates) → chi-squared or z-test for proportions

Comprehensive Example: Profiling a Real Pipeline Dataset

Python

from scipy import stats as sp_stats

def profile_pipeline_metric(data: np.ndarray, metric_name: str) -> dict:
    """
    Full statistical profile for any pipeline metric.
    Returns a dict suitable for logging to your monitoring system.
    """
    describe = sp_stats.describe(data)
    p25, p50, p75 = np.percentile(data, [25, 50, 75])
    iqr = p75 - p25

    # Normality test (use only on sample <= 5000)
    sample = data[:5000] if len(data) > 5000 else data
    _, normality_pval = sp_stats.shapiro(sample)

    return {
        "metric": metric_name,
        "n":              int(describe.nobs),
        "mean":           round(float(describe.mean), 4),
        "median":         round(float(p50), 4),
        "std":            round(float(np.sqrt(describe.variance)), 4),
        "min":            round(float(describe.minmax[0]), 4),
        "max":            round(float(describe.minmax[1]), 4),
        "p25":            round(float(p25), 4),
        "p75":            round(float(p75), 4),
        "p99":            round(float(np.percentile(data, 99)), 4),
        "iqr":            round(float(iqr), 4),
        "skewness":       round(float(describe.skewness), 4),
        "kurtosis":       round(float(describe.kurtosis), 4),
        "cv_pct":         round(float(np.std(data) / np.mean(data) * 100), 2),
        "is_normal":      normality_pval > 0.05,
        "outlier_lower":  round(float(p25 - 1.5 * iqr), 4),
        "outlier_upper":  round(float(p75 + 1.5 * iqr), 4),
    }


# Example: profile your pipeline's row count history
np.random.seed(42)
row_counts = np.concatenate([
    np.random.normal(100_000, 3_500, 85),
    np.random.normal(12_000, 2_000, 5),
])

profile = profile_pipeline_metric(row_counts, "daily_row_count")
for key, val in profile.items():
    print(f"  {key:<18} {val}")

Output:

  metric             daily_row_count
  n                  90
  mean               94736.1234
  median             99912.4432
  std                18423.2211
  ...
  skewness           -2.3411
  is_normal          False
  outlier_lower      88241.5000
  outlier_upper      113142.8000

Visualising Distributions for Pipeline Review

Python

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Histogram with KDE
axes[0].hist(row_counts, bins=20, density=True, alpha=0.6, color='steelblue')
xmin, xmax = axes[0].get_xlim()
x = np.linspace(xmin, xmax, 100)
axes[0].plot(x, sp_stats.norm.pdf(x, np.mean(row_counts), np.std(row_counts)),
             'r--', label='Normal fit')
axes[0].set_title("Row Count Distribution")
axes[0].legend()

# 2. Box plot — IQR and outliers at a glance
axes[1].boxplot(row_counts, vert=True, patch_artist=True,
                boxprops=dict(facecolor='steelblue', alpha=0.6))
axes[1].set_title("Box Plot — Outliers Visible")

# 3. Q-Q plot — how close to normal?
sp_stats.probplot(row_counts, plot=axes[2])
axes[2].set_title("Q-Q Plot (deviation from normal = skew)")

plt.tight_layout()
plt.savefig("pipeline_row_count_distribution.png", dpi=150)

A Q-Q plot where points fall on the straight line means normal distribution. Points that curve away at the tails (as you'll see with your skewed pipeline data) tell you to use median-based thresholds.

Key Takeaways

| Situation | Recommended Statistic | |---|---| | Symmetric operational data | Mean ± 2 std | | Skewed cost / row count data | Median ± 2 IQR | | Alerting thresholds | P95 or P99 from historical baseline | | Comparing before/after pipeline change | Mann-Whitney U (non-parametric) | | Detecting distribution change | Kolmogorov-Smirnov test | | Event stream health check | Variance/mean ratio (should ≈ 1 for Poisson) | | Validating normality assumption | Shapiro-Wilk test (n < 5000) |

The transition from "I eyeball the dashboard" to "I have statistically grounded alerts" is what separates junior data engineers from senior ones. These tools give you that foundation.

Statistics Foundations for Data Engineers

Why Statistics Matters for Data Engineers

Descriptive Statistics: When Each Measure Lies

Mean, Median, Mode

Variance, Standard Deviation, and What They Tell You

Percentiles and IQR: The Engineer's Best Friends

Skewness: Know What You're Working With

Understanding Distributions: Which Data Follows Which

Normal Distribution

Log-Normal Distribution

Poisson Distribution

The Central Limit Theorem: Why It Matters for Pipeline Monitoring

Hypothesis Testing for Data Engineers

The Core Concepts

Practical Example: Did the New ETL Change Data Distribution?

Comprehensive Example: Profiling a Real Pipeline Dataset

Visualising Distributions for Pipeline Review

Key Takeaways

Enjoyed this article?

Leave a comment