Data Engineering System Design Interview Questions (20 Complete Answers)
20 data engineering system design interview questions with complete answers, architecture diagrams, and the key trade-offs interviewers expect you to discuss.
How System Design Interviews Work for Data Engineers
Unlike software engineering system design, DE system design focuses on:
- Throughput and latency trade-offs β batch vs. streaming vs. micro-batch
- Data reliability β exactly-once semantics, idempotency, late data
- Storage design β partitioning, clustering, file format choices
- Operational concerns β schema evolution, backfilling, cost management
For each question below: read the question, identify the key dimensions the interviewer is testing, then study the answer.
Q1: Design a Real-Time Clickstream Pipeline
Question: Design a system to collect, process, and store clickstream events from a web application. The system must support 100k events/second peak load, real-time dashboards with 5-second latency, and long-term analytical queries.
Key dimensions: Ingest scalability, stream processing, storage tiers, latency budget, backpressure handling.
Architecture:
Web Clients
β
βΌ
[Load Balancer]
β
βΌ
[Kafka Cluster β 12 partitions, 3 brokers]
β Topic: clickstream.raw (retention: 7d)
β
ββββββββββββββββββββββββββββββββββββ
βΌ βΌ
[Spark Structured [Flink β real-time]
Streaming β micro-batch] (5s processing window)
β β
βΌ βΌ
[Delta Lake [Redis / ClickHouse]
Bronze/Silver/Gold] (real-time dashboard store)
β
βΌ
[Snowflake β Analytics]Complete answer:
Ingestion layer: Kafka with 12 partitions (partitioned by user_id to maintain event order per user). Producers use async send with acks=1 for low latency. Set max.in.flight.requests.per.connection=1 if ordering is critical.
Streaming processing (two paths):
- Hot path (Flink): 5-second tumbling windows, aggregate page views and clicks per user per session. Write to Redis for the real-time dashboard. Flink checkpoints every 30s to Kafka offsets β this gives exactly-once semantics with Kafka transactional producers.
- Cold path (Spark Structured Streaming): Micro-batch every 2 minutes. Write raw events to Delta Lake Bronze layer. Transform to Silver (cleaned, deduplicated). Build Gold tables (session aggregations, funnel metrics).
Storage:
- Delta Lake Bronze: raw events, partitioned by
dateandhour. Compaction job runs hourly to merge small files. - Snowflake: loaded from Delta Gold daily, used for ad-hoc analytical queries.
- Retention: Delta Lake retains 90 days. Snowflake retains 3 years.
Capacity: At 100k events/sec, assuming 500 bytes average event size = 50 MB/s = 4.3 TB/day. Use Parquet with Snappy compression (3:1 ratio) = ~1.5 TB/day in Delta Lake.
Q2: Design an Idempotent Batch Pipeline
Question: Design a batch pipeline that loads data from a PostgreSQL OLTP database to a Snowflake data warehouse nightly. It must be idempotent β re-running the same pipeline for the same date must produce exactly the same result without duplicates.
Key dimensions: Idempotency strategy, incremental vs full load, watermark management.
Architecture:
[PostgreSQL OLTP]
β
β Extract (changed rows only)
βΌ
[S3 Landing Zone]
β File: orders_2026-05-07_v1.parquet
β Idempotency key: date + run_id
βΌ
[Snowflake Staging Table] β truncated before each load
β
β MERGE into warehouse (upsert by primary key)
βΌ
[Snowflake Warehouse Table]
β
βΌ
[Audit Log: pipeline_runs]Idempotency design:
-- Idempotency pattern using a staging table + MERGE
-- Step 1: Truncate staging (safe to repeat)
TRUNCATE TABLE orders_staging;
-- Step 2: Load today's extract into staging
COPY INTO orders_staging
FROM @s3_stage/orders/2026-05-07/
FILE_FORMAT = (TYPE = PARQUET);
-- Step 3: MERGE into warehouse (idempotent β same result on repeat)
MERGE INTO orders_warehouse AS target
USING orders_staging AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.updated_at > target.updated_at
THEN UPDATE SET
target.status = source.status,
target.revenue = source.revenue,
target.updated_at = source.updated_at
WHEN NOT MATCHED
THEN INSERT (order_id, customer_id, status, revenue, created_at, updated_at)
VALUES (source.order_id, source.customer_id, source.status,
source.revenue, source.created_at, source.updated_at);
-- Step 4: Log the run (also idempotent with MERGE)
MERGE INTO pipeline_runs AS t
USING (SELECT 'orders' AS pipeline, '2026-05-07' AS run_date) AS s
ON t.pipeline = s.pipeline AND t.run_date = s.run_date
WHEN MATCHED THEN UPDATE SET t.last_run_ts = CURRENT_TIMESTAMP, t.status = 'success'
WHEN NOT MATCHED THEN INSERT VALUES (s.pipeline, s.run_date, CURRENT_TIMESTAMP, 'success');Q3: Design a Data Warehouse for E-Commerce
Question: Design the data warehouse schema for an e-commerce company. Must support: sales analytics, customer lifetime value, product performance, and marketing attribution.
Key dimensions: Star schema design, SCD handling, fact table granularity.
Schema Design:
ββββββββββββββββββ
β dim_date β
β date_key PK β
β day, week β
β month, quarterβ
βββββββββ¬βββββββββ
β
ββββββββββββββββ β ββββββββββββββββββββ
β dim_customerβ β β dim_product β
β customer_keyβββββββββββ€ β product_key β
β customer_id β β β product_id β
β email β β β name, category β
β tier β β β unit_cost β
β valid_from β β ββββββββββ¬ββββββββββ
β valid_to β ββββββΌββββββ β
β is_current β β fact_ ββββββββββββββ€
ββββββββββββββββ β orders β β
β β ββββββββββ΄βββββββββββ
ββββββββββββββββ βorder_key β β dim_channel β
β dim_channel ββββββdate_key β β channel_key β
β channel_key β βcustomer_kβ β name, type β
β name, type β βproduct_keβ βββββββββββββββββββββ
ββββββββββββββββ βchannel_keβ
βrevenue β
βquantity β
βdiscount β
βmargin β
ββββββββββββDesign decisions to discuss in the interview:
-
Granularity:
fact_ordersis at order-line level (one row per product per order). This allows product-level analysis without joiningorder_items. For marketing attribution, add a separatefact_touchpointstable. -
SCD Type 2 on dim_customer: Use
valid_from,valid_to,is_currentto track tier changes over time. This lets you answer "what tier was this customer when they made this purchase?" -
Degenerate dimensions:
order_idlives directly in the fact table (degenerate dimension) since you don't need a dim_order table β the order header information is fully captured in the fact row. -
Slowly Changing Dimension Type 1 vs 2:
- Type 1 (overwrite): product price β you want current price for margin calculations
- Type 2 (history): customer tier, address β you want historical context for past orders
Q4: Design a Data Quality Monitoring System
Question: Design a system that continuously monitors data quality across 50+ pipeline tables and alerts on anomalies.
Architecture:
[All Pipeline Tables in DW]
β
βΌ
[Quality Check Scheduler β Airflow]
β Runs quality checks per table after each pipeline load
β
βΌ
[Quality Engine β Python service]
β Runs: null rates, row count anomalies, distribution drift,
β referential integrity, freshness checks
β
βΌ
[quality_checks_log table β Snowflake]
β Columns: table, check_type, run_id, passed, observed_value, threshold, ts
β
ββββββββββββββββββββββββ
βΌ βΌ
[Quality Dashboard] [Alert Router]
(Grafana/Superset) β
βββ Slack (warnings)
βββ PagerDuty (critical failures)Key design decisions:
- Check configuration as code: Store thresholds in YAML per table. Version-controlled. No UI needed to change thresholds.
- Escalation logic: First failure β Slack warning. Three consecutive failures β PagerDuty. This reduces alert fatigue from transient issues.
- Check parallelism: Run checks for different tables in parallel. Each check must be idempotent (tagged with
run_id). - Baseline learning: For row count and mean checks, auto-calculate the baseline from the last 30 runs on first deployment. Re-baseline on explicit approval after intentional schema changes.
Q5: Design the Ingestion Architecture for 50 Microservices Sending Events
Question: 50 microservices each emit domain events (order placed, payment processed, inventory updated). Design an ingestion architecture that's decoupled, scalable, and allows each service team to evolve their schema independently.
Architecture:
[Microservice A] [Microservice B] [Microservice N]
β β β
βββββββββββββββββββ΄ββββββββββββββββββ
β
βΌ
[Kafka β per-domain topics]
orders.placed
payments.processed
inventory.updated
(partitioned by entity_id)
β
ββββββ΄βββββ
βΌ βΌ
[Schema [Dead Letter
Registry] Queue Topic]
(Confluent)
β
βΌ
[Kafka Connect β S3 Sink]
or [Spark Structured Streaming]
β
βΌ
[Delta Lake β Raw Events]
Partition: domain / date / hourSchema evolution strategy: Use Apache Avro schemas registered in Confluent Schema Registry. Configure compatibility as BACKWARD β new schema versions must be readable by old consumers. This allows producer teams to add fields without breaking the ingestion pipeline.
Dead letter queue: Events that fail schema validation or deserialization go to events.dlq topic. A monitoring job alerts if DLQ lag grows above threshold.
Q6: How Would You Handle Late-Arriving Data?
Question: Your pipeline aggregates daily sales metrics. An upstream system sometimes sends corrected events 2-3 days late. How do you handle this?
Key dimensions: Reprocessing strategy, idempotency, downstream impact.
Answer: Three strategies, applied based on business requirements:
Strategy 1 β MERGE / Upsert (for small correction volumes):
-- Accept corrections up to 7 days late
-- Use MERGE to correct the affected date's aggregates
MERGE INTO daily_sales_agg AS target
USING (
SELECT sale_date, SUM(revenue) AS daily_revenue
FROM sales_raw
WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '1 HOUR'
GROUP BY sale_date
) AS source
ON target.sale_date = source.sale_date
WHEN MATCHED THEN UPDATE SET target.daily_revenue = source.daily_revenue,
target.last_updated = CURRENT_TIMESTAMP;Strategy 2 β Watermark with grace period (Spark Streaming):
# Allow 3 days of late data before closing the window
aggregated = (
events_stream
.withWatermark("event_time", "3 days")
.groupBy(
window("event_time", "1 day"),
"product_id"
)
.agg(F.sum("revenue").alias("daily_revenue"))
)Strategy 3 β Partition reprocessing (for full corrections): Drop and reload the affected date's partition. This is the safest approach if corrections are substantial:
# Idempotent partition replace
df.write.mode("overwrite") \
.option("replaceWhere", "sale_date = '2026-05-04'") \
.format("delta") \
.save(DELTA_PATH)Q7: How Do You Ensure Exactly-Once Processing?
Question: Explain how you guarantee exactly-once semantics in a Kafka β Spark β Delta Lake pipeline.
Answer: Exactly-once is achieved by combining idempotent producers, transactional consumers, and atomic writes.
Kafka Producer:
- enable.idempotence=true
- acks=all
- retries=MAX_INT
Kafka Consumer (Spark):
- Read committed offsets from Kafka
- Process batch
- Write to Delta Lake (atomic)
- Commit Kafka offsets ONLY after successful Delta write
Delta Lake:
- Each write is an ACID transaction
- Delta transaction log ensures atomicity
- If job fails after writing but before committing offsets:
- Restart re-reads from last committed offset
- Delta's idempotency key prevents duplicate rows# Delta Lake write with idempotency key
(
processed_df
.write
.format("delta")
.mode("append")
.option("txnAppId", "clickstream_pipeline")
.option("txnVersion", batch_id) # batch_id from Spark Structured Streaming
.save(DELTA_PATH)
)
# Spark Structured Streaming + Delta gives exactly-once end-to-end
# when using foreachBatch with Delta MERGEQ8: Explain the Medallion Architecture and When to Deviate
Question: What is the medallion architecture? When would you not use it?
Medallion Architecture:
Raw Sources
β
βΌ
[Bronze Layer] β Raw data, as-is from source
β Preserve everything, no transformations
β Schema: source schema + metadata columns
β (ingested_at, source_file, batch_id)
βΌ
[Silver Layer] β Cleaned, validated, deduplicated
β Conformed data types
β PII masked
β Failed quality checks quarantined
βΌ
[Gold Layer] β Business-level aggregations
Dimensional models (star schema)
Ready for BI tools and ML featuresWhen to deviate:
-
Simple pipelines with 1-2 transformations: A two-layer (raw + curated) architecture is simpler and cheaper to maintain. Don't add Bronze-Silver-Gold ceremony for a 5-table pipeline.
-
Real-time requirements: Medallion is fundamentally batch-oriented. If you need sub-second data, add a streaming path that bypasses Bronze/Silver and writes directly to a real-time store (Redis, ClickHouse).
-
When Gold tables become the canonical source: In a lakehouse with Delta Lake and Snowflake, sometimes Gold in Delta IS the data warehouse. Don't add a fourth layer by also loading to Snowflake unless there's a specific reason.
Q9: How Do You Handle Schema Changes in Production?
Question: A source system adds 3 new columns and renames 1 existing column overnight. Your pipeline breaks. How do you design against this?
Answer: Schema change resilience requires contracts at the source, detection in the pipeline, and graceful handling.
Detection:
def detect_schema_changes(
current_schema: dict,
expected_schema: dict,
) -> dict:
current_cols = set(current_schema.keys())
expected_cols = set(expected_schema.keys())
added = current_cols - expected_cols
removed = expected_cols - current_cols
type_changes = {
col: (expected_schema[col], current_schema[col])
for col in current_cols & expected_cols
if current_schema[col] != expected_schema[col]
}
return {"added": added, "removed": removed, "type_changes": type_changes}Response strategy by change type:
| Change | Risk | Strategy |
|---|---|---|
| New nullable column added | Low | Auto-evolve Delta schema (mergeSchema=true), backfill with NULL |
| Column renamed | High | Alert + pause pipeline; update mapping and backfill |
| Column type widened (INTβBIGINT) | Low | Auto-evolve |
| Column type narrowed (BIGINTβINT) | High | Alert + manual review |
| Column removed | High | Alert; keep in warehouse with NULL going forward |
# Delta Lake schema evolution β safe for additive changes
df.write.format("delta") \
.option("mergeSchema", "true") \
.mode("append") \
.save(DELTA_PATH)Q10: Design a Cost-Efficient Snowflake Architecture
Question: Your Snowflake bill is $80k/month. Design an architecture that reduces it by 40% without impacting query performance for end users.
Cost Reduction Architecture:
Current (expensive):
[All queries β 1 XL warehouse running 24/7]
Monthly compute: ~$60k
Storage: ~$20k
Optimised:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Warehouse Strategy β
β β
β DEV_WH (XS) β developer queries β
β REPORTING_WH (S) β BI tool queries (auto- β
β suspend 60s, auto-start) β
β PIPELINE_WH (M) β dbt/pipeline runs only β
β ADHOC_WH (L) β data science (manual β
β start only) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Storage Optimisations:
- Enable auto-clustering on high-query tables
- Set data retention to 7 days (from 90) for staging tables
- DROP unused clones older than 3 days
- Move infrequently queried data to Iceberg on S3Specific levers:
-
Warehouse right-sizing: Most BI queries run fine on Small. Use Large only for complex joins/aggregations. Enforce via resource monitors.
-
Query result cache: Ensure repeated identical queries hit Snowflake's result cache. Avoid parameterising queries with
CURRENT_TIMESTAMPin the SELECT (it busts the cache). -
Materialised views: Pre-compute expensive aggregations used in dashboards. Snowflake automatically refreshes them.
-
Table partitioning (clustering): On large tables queried by date range, set clustering key to
order_date. Reduces bytes scanned per query. -
Zero-copy clones for dev/test: Use
CREATE TABLE dev_orders CLONE ordersfor development β no storage cost for the clone until rows diverge.
Questions 11-20: Additional Topics
Q11: Design a Metadata and Data Lineage System
Key components: metadata store (Apache Atlas or DataHub), automatic lineage extraction from Airflow DAGs and SQL transformations, column-level lineage for GDPR right-to-erasure compliance.
Q12: Multi-Region Data Replication Strategy
Active-active vs active-passive. Use Kafka MirrorMaker 2 for cross-region topic replication. Design for RPO=0, RTO=15 minutes. Conflict resolution with last-writer-wins using event timestamps.
Q13: Design an ML Feature Store
Online store (Redis): low-latency serving for real-time models. Offline store (Delta Lake): historical feature values for training. Feature computation pipeline: Spark for batch features, Flink for streaming features. Point-in-time correct feature joins for training data.
Q14: Data Mesh Architecture
Domain-oriented decentralised data ownership. Each domain team owns their data products. Central platform provides: storage infrastructure, data catalogue, quality standards. Trade-off: consistency vs. autonomy.
Q15: Designing for GDPR Right to Erasure
Pseudonymise PII at ingest (hash email with rotating salt). Store salt in a separate key management service. To erase: delete the salt for the user β all their data becomes irrecoverable. Never store PII in partition keys or Parquet file names.
Q16-20: Additional System Design Topics
- Backfill strategy for a 5-year historical load with minimal warehouse disruption
- Designing a chargeback system for cloud data infrastructure costs
- Handling PII in a multi-tenant data platform
- Designing a data catalog with automated schema documentation
- Managing 1000+ dbt models: modularity, CI/CD, testing strategy
What Interviewers Are Scoring
Breadth: Can you cover all layers (ingest β process β store β serve β monitor)?
Depth on trade-offs: Why Kafka over Kinesis? Why Delta Lake over Iceberg here? What breaks at 10x scale?
Operability: How do you backfill? What happens when the pipeline fails? How do you know it failed?
Cost awareness: Cloud costs are real. Interviewers want engineers who think about bytes scanned, warehouse sizing, and storage tier selection.
Evolution: How does your design handle schema changes? Adding new data sources? New consumers?
Lead with requirements, then design top-down, then dive into the component the interviewer cares most about. Always offer trade-offs β there's no single right answer.
Enjoyed this article?
Explore the Data Engineering learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.