Back to blog
Data Engineeringadvanced

Data Engineering System Design Interview Questions (20 Complete Answers)

20 data engineering system design interview questions with complete answers, architecture diagrams, and the key trade-offs interviewers expect you to discuss.

LearnixoMay 7, 202614 min read
System DesignInterview PrepData EngineeringKafkaSparkDelta LakeSnowflakeArchitecture
Share:𝕏

How System Design Interviews Work for Data Engineers

Unlike software engineering system design, DE system design focuses on:

  • Throughput and latency trade-offs β€” batch vs. streaming vs. micro-batch
  • Data reliability β€” exactly-once semantics, idempotency, late data
  • Storage design β€” partitioning, clustering, file format choices
  • Operational concerns β€” schema evolution, backfilling, cost management

For each question below: read the question, identify the key dimensions the interviewer is testing, then study the answer.


Q1: Design a Real-Time Clickstream Pipeline

Question: Design a system to collect, process, and store clickstream events from a web application. The system must support 100k events/second peak load, real-time dashboards with 5-second latency, and long-term analytical queries.

Key dimensions: Ingest scalability, stream processing, storage tiers, latency budget, backpressure handling.

Architecture:

Web Clients
    β”‚
    β–Ό
[Load Balancer]
    β”‚
    β–Ό
[Kafka Cluster β€” 12 partitions, 3 brokers]
    β”‚  Topic: clickstream.raw  (retention: 7d)
    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                                  β–Ό
[Spark Structured           [Flink β€” real-time]
 Streaming β€” micro-batch]    (5s processing window)
    β”‚                                  β”‚
    β–Ό                                  β–Ό
[Delta Lake                      [Redis / ClickHouse]
 Bronze/Silver/Gold]           (real-time dashboard store)
    β”‚
    β–Ό
[Snowflake β€” Analytics]

Complete answer:

Ingestion layer: Kafka with 12 partitions (partitioned by user_id to maintain event order per user). Producers use async send with acks=1 for low latency. Set max.in.flight.requests.per.connection=1 if ordering is critical.

Streaming processing (two paths):

  • Hot path (Flink): 5-second tumbling windows, aggregate page views and clicks per user per session. Write to Redis for the real-time dashboard. Flink checkpoints every 30s to Kafka offsets β€” this gives exactly-once semantics with Kafka transactional producers.
  • Cold path (Spark Structured Streaming): Micro-batch every 2 minutes. Write raw events to Delta Lake Bronze layer. Transform to Silver (cleaned, deduplicated). Build Gold tables (session aggregations, funnel metrics).

Storage:

  • Delta Lake Bronze: raw events, partitioned by date and hour. Compaction job runs hourly to merge small files.
  • Snowflake: loaded from Delta Gold daily, used for ad-hoc analytical queries.
  • Retention: Delta Lake retains 90 days. Snowflake retains 3 years.

Capacity: At 100k events/sec, assuming 500 bytes average event size = 50 MB/s = 4.3 TB/day. Use Parquet with Snappy compression (3:1 ratio) = ~1.5 TB/day in Delta Lake.


Q2: Design an Idempotent Batch Pipeline

Question: Design a batch pipeline that loads data from a PostgreSQL OLTP database to a Snowflake data warehouse nightly. It must be idempotent β€” re-running the same pipeline for the same date must produce exactly the same result without duplicates.

Key dimensions: Idempotency strategy, incremental vs full load, watermark management.

Architecture:

[PostgreSQL OLTP]
    β”‚
    β”‚  Extract (changed rows only)
    β–Ό
[S3 Landing Zone]
    β”‚  File: orders_2026-05-07_v1.parquet
    β”‚  Idempotency key: date + run_id
    β–Ό
[Snowflake Staging Table]  ← truncated before each load
    β”‚
    β”‚  MERGE into warehouse (upsert by primary key)
    β–Ό
[Snowflake Warehouse Table]
    β”‚
    β–Ό
[Audit Log: pipeline_runs]

Idempotency design:

SQL
-- Idempotency pattern using a staging table + MERGE
-- Step 1: Truncate staging (safe to repeat)
TRUNCATE TABLE orders_staging;

-- Step 2: Load today's extract into staging
COPY INTO orders_staging
FROM @s3_stage/orders/2026-05-07/
FILE_FORMAT = (TYPE = PARQUET);

-- Step 3: MERGE into warehouse (idempotent β€” same result on repeat)
MERGE INTO orders_warehouse AS target
USING orders_staging AS source
ON target.order_id = source.order_id

WHEN MATCHED AND source.updated_at > target.updated_at
THEN UPDATE SET
    target.status    = source.status,
    target.revenue   = source.revenue,
    target.updated_at = source.updated_at

WHEN NOT MATCHED
THEN INSERT (order_id, customer_id, status, revenue, created_at, updated_at)
VALUES (source.order_id, source.customer_id, source.status,
        source.revenue, source.created_at, source.updated_at);

-- Step 4: Log the run (also idempotent with MERGE)
MERGE INTO pipeline_runs AS t
USING (SELECT 'orders' AS pipeline, '2026-05-07' AS run_date) AS s
ON t.pipeline = s.pipeline AND t.run_date = s.run_date
WHEN MATCHED THEN UPDATE SET t.last_run_ts = CURRENT_TIMESTAMP, t.status = 'success'
WHEN NOT MATCHED THEN INSERT VALUES (s.pipeline, s.run_date, CURRENT_TIMESTAMP, 'success');

Q3: Design a Data Warehouse for E-Commerce

Question: Design the data warehouse schema for an e-commerce company. Must support: sales analytics, customer lifetime value, product performance, and marketing attribution.

Key dimensions: Star schema design, SCD handling, fact table granularity.

Schema Design:

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  dim_date      β”‚
                 β”‚  date_key PK   β”‚
                 β”‚  day, week     β”‚
                 β”‚  month, quarterβ”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  dim_customerβ”‚         β”‚         β”‚  dim_product      β”‚
β”‚  customer_key│◄─────────         β”‚  product_key      β”‚
β”‚  customer_id β”‚         β”‚         β”‚  product_id       β”‚
β”‚  email       β”‚         β”‚         β”‚  name, category   β”‚
β”‚  tier        β”‚         β”‚         β”‚  unit_cost        β”‚
β”‚  valid_from  β”‚         β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚  valid_to    β”‚    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”            β”‚
β”‚  is_current  β”‚    β”‚ fact_    │◄────────────
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ orders   β”‚            β”‚
                    β”‚          β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚order_key β”‚   β”‚  dim_channel       β”‚
β”‚  dim_channel │◄───│date_key  β”‚   β”‚  channel_key       β”‚
β”‚  channel_key β”‚    β”‚customer_kβ”‚   β”‚  name, type        β”‚
β”‚  name, type  β”‚    β”‚product_keβ”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚channel_keβ”‚
                    β”‚revenue   β”‚
                    β”‚quantity  β”‚
                    β”‚discount  β”‚
                    β”‚margin    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Design decisions to discuss in the interview:

  1. Granularity: fact_orders is at order-line level (one row per product per order). This allows product-level analysis without joining order_items. For marketing attribution, add a separate fact_touchpoints table.

  2. SCD Type 2 on dim_customer: Use valid_from, valid_to, is_current to track tier changes over time. This lets you answer "what tier was this customer when they made this purchase?"

  3. Degenerate dimensions: order_id lives directly in the fact table (degenerate dimension) since you don't need a dim_order table β€” the order header information is fully captured in the fact row.

  4. Slowly Changing Dimension Type 1 vs 2:

    • Type 1 (overwrite): product price β€” you want current price for margin calculations
    • Type 2 (history): customer tier, address β€” you want historical context for past orders

Q4: Design a Data Quality Monitoring System

Question: Design a system that continuously monitors data quality across 50+ pipeline tables and alerts on anomalies.

Architecture:

[All Pipeline Tables in DW]
    β”‚
    β–Ό
[Quality Check Scheduler β€” Airflow]
    β”‚  Runs quality checks per table after each pipeline load
    β”‚
    β–Ό
[Quality Engine β€” Python service]
    β”‚  Runs: null rates, row count anomalies, distribution drift,
    β”‚         referential integrity, freshness checks
    β”‚
    β–Ό
[quality_checks_log table β€” Snowflake]
    β”‚  Columns: table, check_type, run_id, passed, observed_value, threshold, ts
    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                      β–Ό
[Quality Dashboard]  [Alert Router]
  (Grafana/Superset)   β”‚
                        β”œβ”€β”€ Slack (warnings)
                        └── PagerDuty (critical failures)

Key design decisions:

  • Check configuration as code: Store thresholds in YAML per table. Version-controlled. No UI needed to change thresholds.
  • Escalation logic: First failure β†’ Slack warning. Three consecutive failures β†’ PagerDuty. This reduces alert fatigue from transient issues.
  • Check parallelism: Run checks for different tables in parallel. Each check must be idempotent (tagged with run_id).
  • Baseline learning: For row count and mean checks, auto-calculate the baseline from the last 30 runs on first deployment. Re-baseline on explicit approval after intentional schema changes.

Q5: Design the Ingestion Architecture for 50 Microservices Sending Events

Question: 50 microservices each emit domain events (order placed, payment processed, inventory updated). Design an ingestion architecture that's decoupled, scalable, and allows each service team to evolve their schema independently.

Architecture:

[Microservice A]  [Microservice B]  [Microservice N]
       β”‚                 β”‚                 β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
              [Kafka β€” per-domain topics]
               orders.placed
               payments.processed
               inventory.updated
               (partitioned by entity_id)
                         β”‚
                    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
                    β–Ό         β–Ό
             [Schema       [Dead Letter
              Registry]     Queue Topic]
              (Confluent)
                    β”‚
                    β–Ό
           [Kafka Connect β€” S3 Sink]
           or [Spark Structured Streaming]
                    β”‚
                    β–Ό
           [Delta Lake β€” Raw Events]
           Partition: domain / date / hour

Schema evolution strategy: Use Apache Avro schemas registered in Confluent Schema Registry. Configure compatibility as BACKWARD β€” new schema versions must be readable by old consumers. This allows producer teams to add fields without breaking the ingestion pipeline.

Dead letter queue: Events that fail schema validation or deserialization go to events.dlq topic. A monitoring job alerts if DLQ lag grows above threshold.


Q6: How Would You Handle Late-Arriving Data?

Question: Your pipeline aggregates daily sales metrics. An upstream system sometimes sends corrected events 2-3 days late. How do you handle this?

Key dimensions: Reprocessing strategy, idempotency, downstream impact.

Answer: Three strategies, applied based on business requirements:

Strategy 1 β€” MERGE / Upsert (for small correction volumes):

SQL
-- Accept corrections up to 7 days late
-- Use MERGE to correct the affected date's aggregates
MERGE INTO daily_sales_agg AS target
USING (
    SELECT sale_date, SUM(revenue) AS daily_revenue
    FROM sales_raw
    WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '1 HOUR'
    GROUP BY sale_date
) AS source
ON target.sale_date = source.sale_date
WHEN MATCHED THEN UPDATE SET target.daily_revenue = source.daily_revenue,
                              target.last_updated   = CURRENT_TIMESTAMP;

Strategy 2 β€” Watermark with grace period (Spark Streaming):

Python
# Allow 3 days of late data before closing the window
aggregated = (
    events_stream
    .withWatermark("event_time", "3 days")
    .groupBy(
        window("event_time", "1 day"),
        "product_id"
    )
    .agg(F.sum("revenue").alias("daily_revenue"))
)

Strategy 3 β€” Partition reprocessing (for full corrections): Drop and reload the affected date's partition. This is the safest approach if corrections are substantial:

Python
# Idempotent partition replace
df.write.mode("overwrite") \
    .option("replaceWhere", "sale_date = '2026-05-04'") \
    .format("delta") \
    .save(DELTA_PATH)

Q7: How Do You Ensure Exactly-Once Processing?

Question: Explain how you guarantee exactly-once semantics in a Kafka β†’ Spark β†’ Delta Lake pipeline.

Answer: Exactly-once is achieved by combining idempotent producers, transactional consumers, and atomic writes.

Kafka Producer:
  - enable.idempotence=true
  - acks=all
  - retries=MAX_INT

Kafka Consumer (Spark):
  - Read committed offsets from Kafka
  - Process batch
  - Write to Delta Lake (atomic)
  - Commit Kafka offsets ONLY after successful Delta write

Delta Lake:
  - Each write is an ACID transaction
  - Delta transaction log ensures atomicity
  - If job fails after writing but before committing offsets:
    - Restart re-reads from last committed offset
    - Delta's idempotency key prevents duplicate rows
Python
# Delta Lake write with idempotency key
(
    processed_df
    .write
    .format("delta")
    .mode("append")
    .option("txnAppId", "clickstream_pipeline")
    .option("txnVersion", batch_id)  # batch_id from Spark Structured Streaming
    .save(DELTA_PATH)
)
# Spark Structured Streaming + Delta gives exactly-once end-to-end
# when using foreachBatch with Delta MERGE

Q8: Explain the Medallion Architecture and When to Deviate

Question: What is the medallion architecture? When would you not use it?

Medallion Architecture:

Raw Sources
    β”‚
    β–Ό
[Bronze Layer]  ← Raw data, as-is from source
    β”‚              Preserve everything, no transformations
    β”‚              Schema: source schema + metadata columns
    β”‚              (ingested_at, source_file, batch_id)
    β–Ό
[Silver Layer]  ← Cleaned, validated, deduplicated
    β”‚              Conformed data types
    β”‚              PII masked
    β”‚              Failed quality checks quarantined
    β–Ό
[Gold Layer]    ← Business-level aggregations
                   Dimensional models (star schema)
                   Ready for BI tools and ML features

When to deviate:

  1. Simple pipelines with 1-2 transformations: A two-layer (raw + curated) architecture is simpler and cheaper to maintain. Don't add Bronze-Silver-Gold ceremony for a 5-table pipeline.

  2. Real-time requirements: Medallion is fundamentally batch-oriented. If you need sub-second data, add a streaming path that bypasses Bronze/Silver and writes directly to a real-time store (Redis, ClickHouse).

  3. When Gold tables become the canonical source: In a lakehouse with Delta Lake and Snowflake, sometimes Gold in Delta IS the data warehouse. Don't add a fourth layer by also loading to Snowflake unless there's a specific reason.


Q9: How Do You Handle Schema Changes in Production?

Question: A source system adds 3 new columns and renames 1 existing column overnight. Your pipeline breaks. How do you design against this?

Answer: Schema change resilience requires contracts at the source, detection in the pipeline, and graceful handling.

Detection:

Python
def detect_schema_changes(
    current_schema: dict,
    expected_schema: dict,
) -> dict:
    current_cols  = set(current_schema.keys())
    expected_cols = set(expected_schema.keys())

    added     = current_cols - expected_cols
    removed   = expected_cols - current_cols
    type_changes = {
        col: (expected_schema[col], current_schema[col])
        for col in current_cols & expected_cols
        if current_schema[col] != expected_schema[col]
    }
    return {"added": added, "removed": removed, "type_changes": type_changes}

Response strategy by change type:

| Change | Risk | Strategy | |---|---|---| | New nullable column added | Low | Auto-evolve Delta schema (mergeSchema=true), backfill with NULL | | Column renamed | High | Alert + pause pipeline; update mapping and backfill | | Column type widened (INT→BIGINT) | Low | Auto-evolve | | Column type narrowed (BIGINT→INT) | High | Alert + manual review | | Column removed | High | Alert; keep in warehouse with NULL going forward |

Python
# Delta Lake schema evolution β€” safe for additive changes
df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save(DELTA_PATH)

Q10: Design a Cost-Efficient Snowflake Architecture

Question: Your Snowflake bill is $80k/month. Design an architecture that reduces it by 40% without impacting query performance for end users.

Cost Reduction Architecture:

Current (expensive):
[All queries β†’ 1 XL warehouse running 24/7]
    Monthly compute: ~$60k
    Storage:          ~$20k

Optimised:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Warehouse Strategy                          β”‚
β”‚                                             β”‚
β”‚  DEV_WH (XS)    β€” developer queries         β”‚
β”‚  REPORTING_WH (S) β€” BI tool queries (auto-  β”‚
β”‚                     suspend 60s, auto-start) β”‚
β”‚  PIPELINE_WH (M)  β€” dbt/pipeline runs only  β”‚
β”‚  ADHOC_WH (L)     β€” data science (manual    β”‚
β”‚                     start only)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Storage Optimisations:
- Enable auto-clustering on high-query tables
- Set data retention to 7 days (from 90) for staging tables
- DROP unused clones older than 3 days
- Move infrequently queried data to Iceberg on S3

Specific levers:

  1. Warehouse right-sizing: Most BI queries run fine on Small. Use Large only for complex joins/aggregations. Enforce via resource monitors.

  2. Query result cache: Ensure repeated identical queries hit Snowflake's result cache. Avoid parameterising queries with CURRENT_TIMESTAMP in the SELECT (it busts the cache).

  3. Materialised views: Pre-compute expensive aggregations used in dashboards. Snowflake automatically refreshes them.

  4. Table partitioning (clustering): On large tables queried by date range, set clustering key to order_date. Reduces bytes scanned per query.

  5. Zero-copy clones for dev/test: Use CREATE TABLE dev_orders CLONE orders for development β€” no storage cost for the clone until rows diverge.


Questions 11-20: Additional Topics

Q11: Design a Metadata and Data Lineage System

Key components: metadata store (Apache Atlas or DataHub), automatic lineage extraction from Airflow DAGs and SQL transformations, column-level lineage for GDPR right-to-erasure compliance.

Q12: Multi-Region Data Replication Strategy

Active-active vs active-passive. Use Kafka MirrorMaker 2 for cross-region topic replication. Design for RPO=0, RTO=15 minutes. Conflict resolution with last-writer-wins using event timestamps.

Q13: Design an ML Feature Store

Online store (Redis): low-latency serving for real-time models. Offline store (Delta Lake): historical feature values for training. Feature computation pipeline: Spark for batch features, Flink for streaming features. Point-in-time correct feature joins for training data.

Q14: Data Mesh Architecture

Domain-oriented decentralised data ownership. Each domain team owns their data products. Central platform provides: storage infrastructure, data catalogue, quality standards. Trade-off: consistency vs. autonomy.

Q15: Designing for GDPR Right to Erasure

Pseudonymise PII at ingest (hash email with rotating salt). Store salt in a separate key management service. To erase: delete the salt for the user β€” all their data becomes irrecoverable. Never store PII in partition keys or Parquet file names.

Q16-20: Additional System Design Topics

  • Backfill strategy for a 5-year historical load with minimal warehouse disruption
  • Designing a chargeback system for cloud data infrastructure costs
  • Handling PII in a multi-tenant data platform
  • Designing a data catalog with automated schema documentation
  • Managing 1000+ dbt models: modularity, CI/CD, testing strategy

What Interviewers Are Scoring

Breadth: Can you cover all layers (ingest β†’ process β†’ store β†’ serve β†’ monitor)?

Depth on trade-offs: Why Kafka over Kinesis? Why Delta Lake over Iceberg here? What breaks at 10x scale?

Operability: How do you backfill? What happens when the pipeline fails? How do you know it failed?

Cost awareness: Cloud costs are real. Interviewers want engineers who think about bytes scanned, warehouse sizing, and storage tier selection.

Evolution: How does your design handle schema changes? Adding new data sources? New consumers?

Lead with requirements, then design top-down, then dive into the component the interviewer cares most about. Always offer trade-offs β€” there's no single right answer.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.