Data Engineering Fundamentals: ETL, ELT, Batch vs Streaming
Master the core concepts of data engineering — ETL vs ELT, pipeline stages, batch vs streaming, data pipeline architecture, and the tools every data engineer uses daily.
What Is Data Engineering?
Data engineers build the infrastructure that moves, transforms, and stores data so that analysts, data scientists, and business stakeholders can use it. They are the plumbers of the data world — without them, data piles up raw and unusable.
A data engineer's daily work involves:
- Ingesting data from source systems (APIs, databases, files, event streams)
- Transforming and cleaning data to make it queryable and consistent
- Loading it into analytical storage (data warehouses, lakes, lakehouses)
- Scheduling, monitoring, and maintaining those pipelines in production
1. ETL vs ELT
These are the two fundamental patterns for data pipelines.
ETL — Extract, Transform, Load
Source DB ──► Extract ──► Transform ──► Load ──► Data Warehouse
(in compute)- Extract raw data from sources
- Transform it in a processing engine (Python, Spark, custom code)
- Load only the clean, structured result into the warehouse
When to use ETL:
- Sensitive data that must be masked/encrypted before storage
- Heavy transformations that the warehouse can't do efficiently
- Legacy pipelines built before cloud warehouses became powerful
ELT — Extract, Load, Transform
Source DB ──► Extract ──► Load (raw) ──► Transform ──► Analytics layer
(warehouse) (SQL in warehouse)- Extract raw data from sources
- Load it directly into the warehouse as-is
- Transform it inside the warehouse using SQL (dbt, Snowflake, BigQuery)
When to use ELT:
- Modern cloud data warehouses (Snowflake, BigQuery, Redshift) — they're fast and cheap
- When you want to preserve raw data for reprocessing
- SQL-based transformations (dbt models)
- The industry default for analytics engineering
Comparison
| | ETL | ELT | |--|-----|-----| | Transform location | External compute | Inside warehouse | | Raw data stored? | No | Yes | | Flexibility | Lower | Higher (re-transform anytime) | | Cost | Compute fees | Warehouse query fees | | Tools | Python, Spark | dbt, SQL, warehouse-native | | Best for | Sensitive data, complex transforms | Analytics engineering, modern stack |
2. Pipeline Stages
Every data pipeline has these logical stages:
Stage 1: Ingestion (Extract)
Pull data from source systems:
# Pull from REST API
import requests
import json
from datetime import date
def ingest_orders_api(date_from: str, date_to: str) -> list[dict]:
resp = requests.get(
"https://api.example.com/orders",
params={"from": date_from, "to": date_to},
headers={"Authorization": f"Bearer {os.environ['API_TOKEN']}"},
timeout=30,
)
resp.raise_for_status()
return resp.json()["orders"]
# Pull from database
import psycopg2
def ingest_from_postgres(query: str) -> list[dict]:
with psycopg2.connect(os.environ["SOURCE_DB_URL"]) as conn:
with conn.cursor() as cur:
cur.execute(query)
columns = [d[0] for d in cur.description]
return [dict(zip(columns, row)) for row in cur.fetchall()]Stage 2: Validation
Fail early — catch bad data before it pollutes the warehouse:
from dataclasses import dataclass
@dataclass
class ValidationResult:
passed: bool
errors: list[str]
row_count: int
def validate_orders(orders: list[dict]) -> ValidationResult:
errors = []
required = ["order_id", "customer_id", "amount", "created_at"]
for i, row in enumerate(orders):
for field in required:
if field not in row or row[field] is None:
errors.append(f"Row {i}: missing {field}")
if row.get("amount", 0) < 0:
errors.append(f"Row {i}: negative amount {row['amount']}")
return ValidationResult(
passed=len(errors) == 0,
errors=errors,
row_count=len(orders),
)Stage 3: Transformation
Clean, normalize, enrich:
import pandas as pd
from datetime import datetime
def transform_orders(raw: list[dict]) -> pd.DataFrame:
df = pd.DataFrame(raw)
# Normalize
df["order_id"] = df["order_id"].astype(str).str.strip()
df["customer_id"] = df["customer_id"].astype(str)
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["created_at"] = pd.to_datetime(df["created_at"], utc=True)
# Enrich
df["year_month"] = df["created_at"].dt.to_period("M").astype(str)
df["amount_usd"] = df["amount"] * df.get("exchange_rate", 1.0)
# Drop nulls after coercion
df = df.dropna(subset=["order_id", "amount"])
return dfStage 4: Load
Write to the destination:
import snowflake.connector
import pandas as pd
def load_to_snowflake(df: pd.DataFrame, table: str, database: str, schema: str) -> int:
from snowflake.connector.pandas_tools import write_pandas
with snowflake.connector.connect(
account=os.environ["SNOWFLAKE_ACCOUNT"],
user=os.environ["SNOWFLAKE_USER"],
password=os.environ["SNOWFLAKE_PASSWORD"],
database=database,
schema=schema,
) as conn:
success, chunks, rows, _ = write_pandas(conn, df, table.upper())
return rows3. Batch vs Streaming
This is one of the most important architectural decisions in data engineering.
Batch Processing
Data is collected over a time window, then processed all at once.
Data arrives ──► Accumulate ──► Process at 2am ──► Warehouse
(hours/days)Characteristics:
- High throughput — process millions of rows in one run
- Higher latency — data may be hours or a day old
- Simpler to build, test, and debug
- Cheaper at scale
When to use:
- Daily/weekly reports
- End-of-day financial reconciliation
- ML feature engineering
- Data warehouse loads
Example:
# Run at midnight daily
def daily_sales_pipeline():
yesterday = (date.today() - timedelta(days=1)).isoformat()
raw = ingest_orders_api(yesterday, yesterday)
validated = validate_orders(raw)
if not validated.passed:
raise PipelineError(validated.errors)
df = transform_orders(raw)
rows = load_to_snowflake(df, "ORDERS_DAILY", "ANALYTICS", "RAW")
return rowsStreaming Processing
Data is processed record-by-record or in micro-batches as it arrives.
Event ──► Message Queue ──► Stream Processor ──► Output (seconds)
(Kafka/Kinesis) (Flink/Spark Streaming)Characteristics:
- Low latency — seconds to milliseconds
- Complex to build, test, and operate
- More expensive
- Stateful operations are hard
When to use:
- Real-time dashboards
- Fraud detection
- Live notifications
- IoT sensor data
- Clickstream analytics
Common tools: | Tool | Type | Use case | |------|------|---------| | Apache Kafka | Message queue | Event streaming backbone | | AWS Kinesis | Managed streaming | AWS ecosystem | | Apache Flink | Stream processor | Stateful stream processing | | Spark Structured Streaming | Micro-batch | Existing Spark teams | | dbt + Snowflake Dynamic Tables | Near-real-time | SQL-based streaming |
When to use what
Latency needed?
< 1 second → Streaming (Kafka + Flink)
1-60 seconds → Micro-batch (Spark Streaming, Kinesis)
minutes → Near-real-time (Snowflake Dynamic Tables, dbt)
hours/daily → Batch (Airflow + Python + dbt)4. Pipeline Architecture Patterns
Medallion Architecture (Bronze / Silver / Gold)
The most common pattern in modern data engineering:
Raw Sources
│
▼
Bronze Layer ── Raw, unmodified data. Full history. Append-only.
S3/ADLS/Snowflake stage
│
▼
Silver Layer ── Cleaned, validated, typed, deduplicated.
One table per entity. dbt models.
│
▼
Gold Layer ── Business-ready aggregations. Facts & dims.
What analysts and BI tools query.-- Bronze: raw JSON flattened to columns, no cleaning
CREATE TABLE bronze.orders_raw (
_ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
_source VARCHAR,
order_id VARCHAR, -- may have nulls, wrong types
amount VARCHAR, -- stored as string from JSON
raw_payload VARIANT -- full JSON preserved
);
-- Silver: cleaned, typed, deduplicated
CREATE TABLE silver.orders (
order_id VARCHAR PRIMARY KEY,
customer_id VARCHAR NOT NULL,
amount_usd DECIMAL(12, 2) NOT NULL,
created_at TIMESTAMP_TZ NOT NULL,
_updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Gold: aggregated, business-ready
CREATE TABLE gold.daily_revenue (
date DATE,
total_revenue DECIMAL(15, 2),
order_count INTEGER,
avg_order_value DECIMAL(10, 2)
);Lambda Architecture
Combines batch and streaming layers:
Input ──► Stream Layer (fast, approximate) ──► Serving Layer
│ ▲
└► Batch Layer (accurate, slow) ──────────────┘Used at: large-scale systems needing both real-time and accurate historical data. Problem: you maintain two systems — the same logic in two places. Most teams prefer the Kappa Architecture (streaming only) or medallion instead.
5. Data Pipeline Checklist
Before any pipeline goes to production:
Reliability:
- [ ] Retry logic on transient failures (API timeouts, DB hiccups)
- [ ] Idempotent loads — re-running produces same result
- [ ] Dead letter queue / error table for bad records
Observability:
- [ ] Logging at each stage (rows ingested, rows rejected, duration)
- [ ] Alerting on failure (PagerDuty, Slack, email)
- [ ] Row count checks between stages
Data Quality:
- [ ] Schema validation (required fields, types)
- [ ] Business rule validation (amounts > 0, dates in range)
- [ ] Freshness check (is data newer than expected?)
Operations:
- [ ] Backfill capability (re-run from a past date)
- [ ] Configurable date range (not hardcoded yesterday)
- [ ] Environment separation (dev, staging, prod)
6. Tools Overview
| Category | Tools | |----------|-------| | Orchestration | Apache Airflow, Prefect, Dagster, Mage | | Ingestion | Fivetran, Airbyte, Stitch, custom Python | | Transformation | dbt (SQL), Spark (Python), pandas | | Storage | Snowflake, BigQuery, Redshift, Delta Lake | | Streaming | Kafka, Kinesis, Flink, Spark Streaming | | Data Quality | Great Expectations, dbt tests, Soda | | Monitoring | Datadog, Monte Carlo, Bigeye |
Summary
| Concept | Key Point | |---------|-----------| | ETL | Transform outside warehouse — use for sensitive data, complex transforms | | ELT | Transform inside warehouse — the modern default with dbt + Snowflake | | Batch | Simple, cheap, high-latency — daily/hourly reports | | Streaming | Complex, expensive, low-latency — real-time events | | Medallion | Bronze (raw) → Silver (clean) → Gold (business-ready) | | Validation | Check schema + business rules before every load | | Idempotency | Re-running a pipeline must produce the same result |
Next: orchestrating pipelines with Apache Airflow.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.