Data Engineering Fundamentals: ETL, ELT, Batch vs Streaming

What Is Data Engineering?

Data engineers build the infrastructure that moves, transforms, and stores data so that analysts, data scientists, and business stakeholders can use it. They are the plumbers of the data world — without them, data piles up raw and unusable.

A data engineer's daily work involves:

Ingesting data from source systems (APIs, databases, files, event streams)
Transforming and cleaning data to make it queryable and consistent
Loading it into analytical storage (data warehouses, lakes, lakehouses)
Scheduling, monitoring, and maintaining those pipelines in production

1. ETL vs ELT

These are the two fundamental patterns for data pipelines.

ETL — Extract, Transform, Load

Source DB ──► Extract ──► Transform ──► Load ──► Data Warehouse
                          (in compute)

Extract raw data from sources
Transform it in a processing engine (Python, Spark, custom code)
Load only the clean, structured result into the warehouse

When to use ETL:

Sensitive data that must be masked/encrypted before storage
Heavy transformations that the warehouse can't do efficiently
Legacy pipelines built before cloud warehouses became powerful

ELT — Extract, Load, Transform

Source DB ──► Extract ──► Load (raw) ──► Transform ──► Analytics layer
                          (warehouse)   (SQL in warehouse)

Extract raw data from sources
Load it directly into the warehouse as-is
Transform it inside the warehouse using SQL (dbt, Snowflake, BigQuery)

When to use ELT:

Modern cloud data warehouses (Snowflake, BigQuery, Redshift) — they're fast and cheap
When you want to preserve raw data for reprocessing
SQL-based transformations (dbt models)
The industry default for analytics engineering

Comparison

| | ETL | ELT | |--|-----|-----| | Transform location | External compute | Inside warehouse | | Raw data stored? | No | Yes | | Flexibility | Lower | Higher (re-transform anytime) | | Cost | Compute fees | Warehouse query fees | | Tools | Python, Spark | dbt, SQL, warehouse-native | | Best for | Sensitive data, complex transforms | Analytics engineering, modern stack |

2. Pipeline Stages

Every data pipeline has these logical stages:

Stage 1: Ingestion (Extract)

Pull data from source systems:

Python

# Pull from REST API
import requests
import json
from datetime import date

def ingest_orders_api(date_from: str, date_to: str) -> list[dict]:
    resp = requests.get(
        "https://api.example.com/orders",
        params={"from": date_from, "to": date_to},
        headers={"Authorization": f"Bearer {os.environ['API_TOKEN']}"},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["orders"]

# Pull from database
import psycopg2

def ingest_from_postgres(query: str) -> list[dict]:
    with psycopg2.connect(os.environ["SOURCE_DB_URL"]) as conn:
        with conn.cursor() as cur:
            cur.execute(query)
            columns = [d[0] for d in cur.description]
            return [dict(zip(columns, row)) for row in cur.fetchall()]

Stage 2: Validation

Fail early — catch bad data before it pollutes the warehouse:

Python

from dataclasses import dataclass

@dataclass
class ValidationResult:
    passed: bool
    errors: list[str]
    row_count: int

def validate_orders(orders: list[dict]) -> ValidationResult:
    errors = []
    required = ["order_id", "customer_id", "amount", "created_at"]

    for i, row in enumerate(orders):
        for field in required:
            if field not in row or row[field] is None:
                errors.append(f"Row {i}: missing {field}")
        if row.get("amount", 0) < 0:
            errors.append(f"Row {i}: negative amount {row['amount']}")

    return ValidationResult(
        passed=len(errors) == 0,
        errors=errors,
        row_count=len(orders),
    )

Stage 3: Transformation

Clean, normalize, enrich:

Python

import pandas as pd
from datetime import datetime

def transform_orders(raw: list[dict]) -> pd.DataFrame:
    df = pd.DataFrame(raw)

    # Normalize
    df["order_id"] = df["order_id"].astype(str).str.strip()
    df["customer_id"] = df["customer_id"].astype(str)
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    df["created_at"] = pd.to_datetime(df["created_at"], utc=True)

    # Enrich
    df["year_month"] = df["created_at"].dt.to_period("M").astype(str)
    df["amount_usd"] = df["amount"] * df.get("exchange_rate", 1.0)

    # Drop nulls after coercion
    df = df.dropna(subset=["order_id", "amount"])

    return df

Stage 4: Load

Write to the destination:

Python

import snowflake.connector
import pandas as pd

def load_to_snowflake(df: pd.DataFrame, table: str, database: str, schema: str) -> int:
    from snowflake.connector.pandas_tools import write_pandas

    with snowflake.connector.connect(
        account=os.environ["SNOWFLAKE_ACCOUNT"],
        user=os.environ["SNOWFLAKE_USER"],
        password=os.environ["SNOWFLAKE_PASSWORD"],
        database=database,
        schema=schema,
    ) as conn:
        success, chunks, rows, _ = write_pandas(conn, df, table.upper())
        return rows

3. Batch vs Streaming

This is one of the most important architectural decisions in data engineering.

Batch Processing

Data is collected over a time window, then processed all at once.

Data arrives ──► Accumulate ──► Process at 2am ──► Warehouse
               (hours/days)

Characteristics:

High throughput — process millions of rows in one run
Higher latency — data may be hours or a day old
Simpler to build, test, and debug
Cheaper at scale

When to use:

Daily/weekly reports
End-of-day financial reconciliation
ML feature engineering
Data warehouse loads

Example:

Python

# Run at midnight daily
def daily_sales_pipeline():
    yesterday = (date.today() - timedelta(days=1)).isoformat()
    raw = ingest_orders_api(yesterday, yesterday)
    validated = validate_orders(raw)
    if not validated.passed:
        raise PipelineError(validated.errors)
    df = transform_orders(raw)
    rows = load_to_snowflake(df, "ORDERS_DAILY", "ANALYTICS", "RAW")
    return rows

Streaming Processing

Data is processed record-by-record or in micro-batches as it arrives.

Event ──► Message Queue ──► Stream Processor ──► Output (seconds)
         (Kafka/Kinesis)    (Flink/Spark Streaming)

Characteristics:

Low latency — seconds to milliseconds
Complex to build, test, and operate
More expensive
Stateful operations are hard

When to use:

Real-time dashboards
Fraud detection
Live notifications
IoT sensor data
Clickstream analytics

Common tools: | Tool | Type | Use case | |------|------|---------| | Apache Kafka | Message queue | Event streaming backbone | | AWS Kinesis | Managed streaming | AWS ecosystem | | Apache Flink | Stream processor | Stateful stream processing | | Spark Structured Streaming | Micro-batch | Existing Spark teams | | dbt + Snowflake Dynamic Tables | Near-real-time | SQL-based streaming |

When to use what

Latency needed?
  < 1 second   → Streaming (Kafka + Flink)
  1-60 seconds → Micro-batch (Spark Streaming, Kinesis)
  minutes      → Near-real-time (Snowflake Dynamic Tables, dbt)
  hours/daily  → Batch (Airflow + Python + dbt)

4. Pipeline Architecture Patterns

Medallion Architecture (Bronze / Silver / Gold)

The most common pattern in modern data engineering:

Raw Sources
    │
    ▼
Bronze Layer  ──  Raw, unmodified data. Full history. Append-only.
                  S3/ADLS/Snowflake stage
    │
    ▼
Silver Layer  ──  Cleaned, validated, typed, deduplicated.
                  One table per entity. dbt models.
    │
    ▼
Gold Layer    ──  Business-ready aggregations. Facts & dims.
                  What analysts and BI tools query.

SQL

-- Bronze: raw JSON flattened to columns, no cleaning
CREATE TABLE bronze.orders_raw (
    _ingested_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    _source         VARCHAR,
    order_id        VARCHAR,   -- may have nulls, wrong types
    amount          VARCHAR,   -- stored as string from JSON
    raw_payload     VARIANT    -- full JSON preserved
);

-- Silver: cleaned, typed, deduplicated
CREATE TABLE silver.orders (
    order_id        VARCHAR PRIMARY KEY,
    customer_id     VARCHAR NOT NULL,
    amount_usd      DECIMAL(12, 2) NOT NULL,
    created_at      TIMESTAMP_TZ NOT NULL,
    _updated_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Gold: aggregated, business-ready
CREATE TABLE gold.daily_revenue (
    date            DATE,
    total_revenue   DECIMAL(15, 2),
    order_count     INTEGER,
    avg_order_value DECIMAL(10, 2)
);

Lambda Architecture

Combines batch and streaming layers:

Input ──► Stream Layer (fast, approximate) ──► Serving Layer
      │                                              ▲
      └► Batch Layer (accurate, slow) ──────────────┘

Used at: large-scale systems needing both real-time and accurate historical data. Problem: you maintain two systems — the same logic in two places. Most teams prefer the Kappa Architecture (streaming only) or medallion instead.

5. Data Pipeline Checklist

Before any pipeline goes to production:

Reliability:

[ ] Retry logic on transient failures (API timeouts, DB hiccups)
[ ] Idempotent loads — re-running produces same result
[ ] Dead letter queue / error table for bad records

Observability:

[ ] Logging at each stage (rows ingested, rows rejected, duration)
[ ] Alerting on failure (PagerDuty, Slack, email)
[ ] Row count checks between stages

Data Quality:

[ ] Schema validation (required fields, types)
[ ] Business rule validation (amounts > 0, dates in range)
[ ] Freshness check (is data newer than expected?)

Operations:

[ ] Backfill capability (re-run from a past date)
[ ] Configurable date range (not hardcoded yesterday)
[ ] Environment separation (dev, staging, prod)

6. Tools Overview

| Category | Tools | |----------|-------| | Orchestration | Apache Airflow, Prefect, Dagster, Mage | | Ingestion | Fivetran, Airbyte, Stitch, custom Python | | Transformation | dbt (SQL), Spark (Python), pandas | | Storage | Snowflake, BigQuery, Redshift, Delta Lake | | Streaming | Kafka, Kinesis, Flink, Spark Streaming | | Data Quality | Great Expectations, dbt tests, Soda | | Monitoring | Datadog, Monte Carlo, Bigeye |

Summary

| Concept | Key Point | |---------|-----------| | ETL | Transform outside warehouse — use for sensitive data, complex transforms | | ELT | Transform inside warehouse — the modern default with dbt + Snowflake | | Batch | Simple, cheap, high-latency — daily/hourly reports | | Streaming | Complex, expensive, low-latency — real-time events | | Medallion | Bronze (raw) → Silver (clean) → Gold (business-ready) | | Validation | Check schema + business rules before every load | | Idempotency | Re-running a pipeline must produce the same result |

Next: orchestrating pipelines with Apache Airflow.

Data Engineering Fundamentals: ETL, ELT, Batch vs Streaming

What Is Data Engineering?

1. ETL vs ELT

ETL — Extract, Transform, Load

ELT — Extract, Load, Transform

Comparison

2. Pipeline Stages

Stage 1: Ingestion (Extract)

Stage 2: Validation

Stage 3: Transformation

Stage 4: Load

3. Batch vs Streaming

Batch Processing

Streaming Processing

When to use what

4. Pipeline Architecture Patterns

Medallion Architecture (Bronze / Silver / Gold)

Lambda Architecture

5. Data Pipeline Checklist

6. Tools Overview

Summary

Enjoyed this article?

Leave a comment