Data Engineering Interview Prep · Lesson 2 of 3
System Design: 20 Data Engineering Scenarios
How System Design Interviews Work for Data Engineers
Unlike software engineering system design, DE system design focuses on:
- Throughput and latency trade-offs — batch vs. streaming vs. micro-batch
- Data reliability — exactly-once semantics, idempotency, late data
- Storage design — partitioning, clustering, file format choices
- Operational concerns — schema evolution, backfilling, cost management
For each question below: read the question, identify the key dimensions the interviewer is testing, then study the answer.
Q1: Design a Real-Time Clickstream Pipeline
Question: Design a system to collect, process, and store clickstream events from a web application. The system must support 100k events/second peak load, real-time dashboards with 5-second latency, and long-term analytical queries.
Key dimensions: Ingest scalability, stream processing, storage tiers, latency budget, backpressure handling.
Architecture:
Web Clients
│
▼
[Load Balancer]
│
▼
[Kafka Cluster — 12 partitions, 3 brokers]
│ Topic: clickstream.raw (retention: 7d)
│
├──────────────────────────────────┐
▼ ▼
[Spark Structured [Flink — real-time]
Streaming — micro-batch] (5s processing window)
│ │
▼ ▼
[Delta Lake [Redis / ClickHouse]
Bronze/Silver/Gold] (real-time dashboard store)
│
▼
[Snowflake — Analytics]Complete answer:
Ingestion layer: Kafka with 12 partitions (partitioned by user_id to maintain event order per user). Producers use async send with acks=1 for low latency. Set max.in.flight.requests.per.connection=1 if ordering is critical.
Streaming processing (two paths):
- Hot path (Flink): 5-second tumbling windows, aggregate page views and clicks per user per session. Write to Redis for the real-time dashboard. Flink checkpoints every 30s to Kafka offsets — this gives exactly-once semantics with Kafka transactional producers.
- Cold path (Spark Structured Streaming): Micro-batch every 2 minutes. Write raw events to Delta Lake Bronze layer. Transform to Silver (cleaned, deduplicated). Build Gold tables (session aggregations, funnel metrics).
Storage:
- Delta Lake Bronze: raw events, partitioned by
dateandhour. Compaction job runs hourly to merge small files. - Snowflake: loaded from Delta Gold daily, used for ad-hoc analytical queries.
- Retention: Delta Lake retains 90 days. Snowflake retains 3 years.
Capacity: At 100k events/sec, assuming 500 bytes average event size = 50 MB/s = 4.3 TB/day. Use Parquet with Snappy compression (3:1 ratio) = ~1.5 TB/day in Delta Lake.
Q2: Design an Idempotent Batch Pipeline
Question: Design a batch pipeline that loads data from a PostgreSQL OLTP database to a Snowflake data warehouse nightly. It must be idempotent — re-running the same pipeline for the same date must produce exactly the same result without duplicates.
Key dimensions: Idempotency strategy, incremental vs full load, watermark management.
Architecture:
[PostgreSQL OLTP]
│
│ Extract (changed rows only)
▼
[S3 Landing Zone]
│ File: orders_2026-05-07_v1.parquet
│ Idempotency key: date + run_id
▼
[Snowflake Staging Table] ← truncated before each load
│
│ MERGE into warehouse (upsert by primary key)
▼
[Snowflake Warehouse Table]
│
▼
[Audit Log: pipeline_runs]Idempotency design:
-- Idempotency pattern using a staging table + MERGE
-- Step 1: Truncate staging (safe to repeat)
TRUNCATE TABLE orders_staging;
-- Step 2: Load today's extract into staging
COPY INTO orders_staging
FROM @s3_stage/orders/2026-05-07/
FILE_FORMAT = (TYPE = PARQUET);
-- Step 3: MERGE into warehouse (idempotent — same result on repeat)
MERGE INTO orders_warehouse AS target
USING orders_staging AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.updated_at > target.updated_at
THEN UPDATE SET
target.status = source.status,
target.revenue = source.revenue,
target.updated_at = source.updated_at
WHEN NOT MATCHED
THEN INSERT (order_id, customer_id, status, revenue, created_at, updated_at)
VALUES (source.order_id, source.customer_id, source.status,
source.revenue, source.created_at, source.updated_at);
-- Step 4: Log the run (also idempotent with MERGE)
MERGE INTO pipeline_runs AS t
USING (SELECT 'orders' AS pipeline, '2026-05-07' AS run_date) AS s
ON t.pipeline = s.pipeline AND t.run_date = s.run_date
WHEN MATCHED THEN UPDATE SET t.last_run_ts = CURRENT_TIMESTAMP, t.status = 'success'
WHEN NOT MATCHED THEN INSERT VALUES (s.pipeline, s.run_date, CURRENT_TIMESTAMP, 'success');Q3: Design a Data Warehouse for E-Commerce
Question: Design the data warehouse schema for an e-commerce company. Must support: sales analytics, customer lifetime value, product performance, and marketing attribution.
Key dimensions: Star schema design, SCD handling, fact table granularity.
Schema Design:
┌────────────────┐
│ dim_date │
│ date_key PK │
│ day, week │
│ month, quarter│
└───────┬────────┘
│
┌──────────────┐ │ ┌──────────────────┐
│ dim_customer│ │ │ dim_product │
│ customer_key│◄────────┤ │ product_key │
│ customer_id │ │ │ product_id │
│ email │ │ │ name, category │
│ tier │ │ │ unit_cost │
│ valid_from │ │ └────────┬─────────┘
│ valid_to │ ┌────▼─────┐ │
│ is_current │ │ fact_ │◄───────────┤
└──────────────┘ │ orders │ │
│ │ ┌────────┴──────────┐
┌──────────────┐ │order_key │ │ dim_channel │
│ dim_channel │◄───│date_key │ │ channel_key │
│ channel_key │ │customer_k│ │ name, type │
│ name, type │ │product_ke│ └───────────────────┘
└──────────────┘ │channel_ke│
│revenue │
│quantity │
│discount │
│margin │
└──────────┘Design decisions to discuss in the interview:
-
Granularity:
fact_ordersis at order-line level (one row per product per order). This allows product-level analysis without joiningorder_items. For marketing attribution, add a separatefact_touchpointstable. -
SCD Type 2 on dim_customer: Use
valid_from,valid_to,is_currentto track tier changes over time. This lets you answer "what tier was this customer when they made this purchase?" -
Degenerate dimensions:
order_idlives directly in the fact table (degenerate dimension) since you don't need a dim_order table — the order header information is fully captured in the fact row. -
Slowly Changing Dimension Type 1 vs 2:
- Type 1 (overwrite): product price — you want current price for margin calculations
- Type 2 (history): customer tier, address — you want historical context for past orders
Q4: Design a Data Quality Monitoring System
Question: Design a system that continuously monitors data quality across 50+ pipeline tables and alerts on anomalies.
Architecture:
[All Pipeline Tables in DW]
│
▼
[Quality Check Scheduler — Airflow]
│ Runs quality checks per table after each pipeline load
│
▼
[Quality Engine — Python service]
│ Runs: null rates, row count anomalies, distribution drift,
│ referential integrity, freshness checks
│
▼
[quality_checks_log table — Snowflake]
│ Columns: table, check_type, run_id, passed, observed_value, threshold, ts
│
├──────────────────────┐
▼ ▼
[Quality Dashboard] [Alert Router]
(Grafana/Superset) │
├── Slack (warnings)
└── PagerDuty (critical failures)Key design decisions:
- Check configuration as code: Store thresholds in YAML per table. Version-controlled. No UI needed to change thresholds.
- Escalation logic: First failure → Slack warning. Three consecutive failures → PagerDuty. This reduces alert fatigue from transient issues.
- Check parallelism: Run checks for different tables in parallel. Each check must be idempotent (tagged with
run_id). - Baseline learning: For row count and mean checks, auto-calculate the baseline from the last 30 runs on first deployment. Re-baseline on explicit approval after intentional schema changes.
Q5: Design the Ingestion Architecture for 50 Microservices Sending Events
Question: 50 microservices each emit domain events (order placed, payment processed, inventory updated). Design an ingestion architecture that's decoupled, scalable, and allows each service team to evolve their schema independently.
Architecture:
[Microservice A] [Microservice B] [Microservice N]
│ │ │
└─────────────────┴─────────────────┘
│
▼
[Kafka — per-domain topics]
orders.placed
payments.processed
inventory.updated
(partitioned by entity_id)
│
┌────┴────┐
▼ ▼
[Schema [Dead Letter
Registry] Queue Topic]
(Confluent)
│
▼
[Kafka Connect — S3 Sink]
or [Spark Structured Streaming]
│
▼
[Delta Lake — Raw Events]
Partition: domain / date / hourSchema evolution strategy: Use Apache Avro schemas registered in Confluent Schema Registry. Configure compatibility as BACKWARD — new schema versions must be readable by old consumers. This allows producer teams to add fields without breaking the ingestion pipeline.
Dead letter queue: Events that fail schema validation or deserialization go to events.dlq topic. A monitoring job alerts if DLQ lag grows above threshold.
Q6: How Would You Handle Late-Arriving Data?
Question: Your pipeline aggregates daily sales metrics. An upstream system sometimes sends corrected events 2-3 days late. How do you handle this?
Key dimensions: Reprocessing strategy, idempotency, downstream impact.
Answer: Three strategies, applied based on business requirements:
Strategy 1 — MERGE / Upsert (for small correction volumes):
-- Accept corrections up to 7 days late
-- Use MERGE to correct the affected date's aggregates
MERGE INTO daily_sales_agg AS target
USING (
SELECT sale_date, SUM(revenue) AS daily_revenue
FROM sales_raw
WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '1 HOUR'
GROUP BY sale_date
) AS source
ON target.sale_date = source.sale_date
WHEN MATCHED THEN UPDATE SET target.daily_revenue = source.daily_revenue,
target.last_updated = CURRENT_TIMESTAMP;Strategy 2 — Watermark with grace period (Spark Streaming):
# Allow 3 days of late data before closing the window
aggregated = (
events_stream
.withWatermark("event_time", "3 days")
.groupBy(
window("event_time", "1 day"),
"product_id"
)
.agg(F.sum("revenue").alias("daily_revenue"))
)Strategy 3 — Partition reprocessing (for full corrections): Drop and reload the affected date's partition. This is the safest approach if corrections are substantial:
# Idempotent partition replace
df.write.mode("overwrite") \
.option("replaceWhere", "sale_date = '2026-05-04'") \
.format("delta") \
.save(DELTA_PATH)Q7: How Do You Ensure Exactly-Once Processing?
Question: Explain how you guarantee exactly-once semantics in a Kafka → Spark → Delta Lake pipeline.
Answer: Exactly-once is achieved by combining idempotent producers, transactional consumers, and atomic writes.
Kafka Producer:
- enable.idempotence=true
- acks=all
- retries=MAX_INT
Kafka Consumer (Spark):
- Read committed offsets from Kafka
- Process batch
- Write to Delta Lake (atomic)
- Commit Kafka offsets ONLY after successful Delta write
Delta Lake:
- Each write is an ACID transaction
- Delta transaction log ensures atomicity
- If job fails after writing but before committing offsets:
- Restart re-reads from last committed offset
- Delta's idempotency key prevents duplicate rows# Delta Lake write with idempotency key
(
processed_df
.write
.format("delta")
.mode("append")
.option("txnAppId", "clickstream_pipeline")
.option("txnVersion", batch_id) # batch_id from Spark Structured Streaming
.save(DELTA_PATH)
)
# Spark Structured Streaming + Delta gives exactly-once end-to-end
# when using foreachBatch with Delta MERGEQ8: Explain the Medallion Architecture and When to Deviate
Question: What is the medallion architecture? When would you not use it?
Medallion Architecture:
Raw Sources
│
▼
[Bronze Layer] ← Raw data, as-is from source
│ Preserve everything, no transformations
│ Schema: source schema + metadata columns
│ (ingested_at, source_file, batch_id)
▼
[Silver Layer] ← Cleaned, validated, deduplicated
│ Conformed data types
│ PII masked
│ Failed quality checks quarantined
▼
[Gold Layer] ← Business-level aggregations
Dimensional models (star schema)
Ready for BI tools and ML featuresWhen to deviate:
-
Simple pipelines with 1-2 transformations: A two-layer (raw + curated) architecture is simpler and cheaper to maintain. Don't add Bronze-Silver-Gold ceremony for a 5-table pipeline.
-
Real-time requirements: Medallion is fundamentally batch-oriented. If you need sub-second data, add a streaming path that bypasses Bronze/Silver and writes directly to a real-time store (Redis, ClickHouse).
-
When Gold tables become the canonical source: In a lakehouse with Delta Lake and Snowflake, sometimes Gold in Delta IS the data warehouse. Don't add a fourth layer by also loading to Snowflake unless there's a specific reason.
Q9: How Do You Handle Schema Changes in Production?
Question: A source system adds 3 new columns and renames 1 existing column overnight. Your pipeline breaks. How do you design against this?
Answer: Schema change resilience requires contracts at the source, detection in the pipeline, and graceful handling.
Detection:
def detect_schema_changes(
current_schema: dict,
expected_schema: dict,
) -> dict:
current_cols = set(current_schema.keys())
expected_cols = set(expected_schema.keys())
added = current_cols - expected_cols
removed = expected_cols - current_cols
type_changes = {
col: (expected_schema[col], current_schema[col])
for col in current_cols & expected_cols
if current_schema[col] != expected_schema[col]
}
return {"added": added, "removed": removed, "type_changes": type_changes}Response strategy by change type:
| Change | Risk | Strategy |
|---|---|---|
| New nullable column added | Low | Auto-evolve Delta schema (mergeSchema=true), backfill with NULL |
| Column renamed | High | Alert + pause pipeline; update mapping and backfill |
| Column type widened (INT→BIGINT) | Low | Auto-evolve |
| Column type narrowed (BIGINT→INT) | High | Alert + manual review |
| Column removed | High | Alert; keep in warehouse with NULL going forward |
# Delta Lake schema evolution — safe for additive changes
df.write.format("delta") \
.option("mergeSchema", "true") \
.mode("append") \
.save(DELTA_PATH)Q10: Design a Cost-Efficient Snowflake Architecture
Question: Your Snowflake bill is $80k/month. Design an architecture that reduces it by 40% without impacting query performance for end users.
Cost Reduction Architecture:
Current (expensive):
[All queries → 1 XL warehouse running 24/7]
Monthly compute: ~$60k
Storage: ~$20k
Optimised:
┌─────────────────────────────────────────────┐
│ Warehouse Strategy │
│ │
│ DEV_WH (XS) — developer queries │
│ REPORTING_WH (S) — BI tool queries (auto- │
│ suspend 60s, auto-start) │
│ PIPELINE_WH (M) — dbt/pipeline runs only │
│ ADHOC_WH (L) — data science (manual │
│ start only) │
└─────────────────────────────────────────────┘
Storage Optimisations:
- Enable auto-clustering on high-query tables
- Set data retention to 7 days (from 90) for staging tables
- DROP unused clones older than 3 days
- Move infrequently queried data to Iceberg on S3Specific levers:
-
Warehouse right-sizing: Most BI queries run fine on Small. Use Large only for complex joins/aggregations. Enforce via resource monitors.
-
Query result cache: Ensure repeated identical queries hit Snowflake's result cache. Avoid parameterising queries with
CURRENT_TIMESTAMPin the SELECT (it busts the cache). -
Materialised views: Pre-compute expensive aggregations used in dashboards. Snowflake automatically refreshes them.
-
Table partitioning (clustering): On large tables queried by date range, set clustering key to
order_date. Reduces bytes scanned per query. -
Zero-copy clones for dev/test: Use
CREATE TABLE dev_orders CLONE ordersfor development — no storage cost for the clone until rows diverge.
Questions 11-20: Additional Topics
Q11: Design a Metadata and Data Lineage System
Key components: metadata store (Apache Atlas or DataHub), automatic lineage extraction from Airflow DAGs and SQL transformations, column-level lineage for GDPR right-to-erasure compliance.
Q12: Multi-Region Data Replication Strategy
Active-active vs active-passive. Use Kafka MirrorMaker 2 for cross-region topic replication. Design for RPO=0, RTO=15 minutes. Conflict resolution with last-writer-wins using event timestamps.
Q13: Design an ML Feature Store
Online store (Redis): low-latency serving for real-time models. Offline store (Delta Lake): historical feature values for training. Feature computation pipeline: Spark for batch features, Flink for streaming features. Point-in-time correct feature joins for training data.
Q14: Data Mesh Architecture
Domain-oriented decentralised data ownership. Each domain team owns their data products. Central platform provides: storage infrastructure, data catalogue, quality standards. Trade-off: consistency vs. autonomy.
Q15: Designing for GDPR Right to Erasure
Pseudonymise PII at ingest (hash email with rotating salt). Store salt in a separate key management service. To erase: delete the salt for the user — all their data becomes irrecoverable. Never store PII in partition keys or Parquet file names.
Q16-20: Additional System Design Topics
- Backfill strategy for a 5-year historical load with minimal warehouse disruption
- Designing a chargeback system for cloud data infrastructure costs
- Handling PII in a multi-tenant data platform
- Designing a data catalog with automated schema documentation
- Managing 1000+ dbt models: modularity, CI/CD, testing strategy
What Interviewers Are Scoring
Breadth: Can you cover all layers (ingest → process → store → serve → monitor)?
Depth on trade-offs: Why Kafka over Kinesis? Why Delta Lake over Iceberg here? What breaks at 10x scale?
Operability: How do you backfill? What happens when the pipeline fails? How do you know it failed?
Cost awareness: Cloud costs are real. Interviewers want engineers who think about bytes scanned, warehouse sizing, and storage tier selection.
Evolution: How does your design handle schema changes? Adding new data sources? New consumers?
Lead with requirements, then design top-down, then dive into the component the interviewer cares most about. Always offer trade-offs — there's no single right answer.