Learnixo

Data Engineering Interview Prep · Lesson 2 of 3

System Design: 20 Data Engineering Scenarios

How System Design Interviews Work for Data Engineers

Unlike software engineering system design, DE system design focuses on:

  • Throughput and latency trade-offs — batch vs. streaming vs. micro-batch
  • Data reliability — exactly-once semantics, idempotency, late data
  • Storage design — partitioning, clustering, file format choices
  • Operational concerns — schema evolution, backfilling, cost management

For each question below: read the question, identify the key dimensions the interviewer is testing, then study the answer.


Q1: Design a Real-Time Clickstream Pipeline

Question: Design a system to collect, process, and store clickstream events from a web application. The system must support 100k events/second peak load, real-time dashboards with 5-second latency, and long-term analytical queries.

Key dimensions: Ingest scalability, stream processing, storage tiers, latency budget, backpressure handling.

Architecture:

Web Clients
    │
    ▼
[Load Balancer]
    │
    ▼
[Kafka Cluster — 12 partitions, 3 brokers]
    │  Topic: clickstream.raw  (retention: 7d)
    │
    ├──────────────────────────────────┐
    ▼                                  ▼
[Spark Structured           [Flink — real-time]
 Streaming — micro-batch]    (5s processing window)
    │                                  │
    ▼                                  ▼
[Delta Lake                      [Redis / ClickHouse]
 Bronze/Silver/Gold]           (real-time dashboard store)
    │
    ▼
[Snowflake — Analytics]

Complete answer:

Ingestion layer: Kafka with 12 partitions (partitioned by user_id to maintain event order per user). Producers use async send with acks=1 for low latency. Set max.in.flight.requests.per.connection=1 if ordering is critical.

Streaming processing (two paths):

  • Hot path (Flink): 5-second tumbling windows, aggregate page views and clicks per user per session. Write to Redis for the real-time dashboard. Flink checkpoints every 30s to Kafka offsets — this gives exactly-once semantics with Kafka transactional producers.
  • Cold path (Spark Structured Streaming): Micro-batch every 2 minutes. Write raw events to Delta Lake Bronze layer. Transform to Silver (cleaned, deduplicated). Build Gold tables (session aggregations, funnel metrics).

Storage:

  • Delta Lake Bronze: raw events, partitioned by date and hour. Compaction job runs hourly to merge small files.
  • Snowflake: loaded from Delta Gold daily, used for ad-hoc analytical queries.
  • Retention: Delta Lake retains 90 days. Snowflake retains 3 years.

Capacity: At 100k events/sec, assuming 500 bytes average event size = 50 MB/s = 4.3 TB/day. Use Parquet with Snappy compression (3:1 ratio) = ~1.5 TB/day in Delta Lake.


Q2: Design an Idempotent Batch Pipeline

Question: Design a batch pipeline that loads data from a PostgreSQL OLTP database to a Snowflake data warehouse nightly. It must be idempotent — re-running the same pipeline for the same date must produce exactly the same result without duplicates.

Key dimensions: Idempotency strategy, incremental vs full load, watermark management.

Architecture:

[PostgreSQL OLTP]
    │
    │  Extract (changed rows only)
    ▼
[S3 Landing Zone]
    │  File: orders_2026-05-07_v1.parquet
    │  Idempotency key: date + run_id
    ▼
[Snowflake Staging Table]  ← truncated before each load
    │
    │  MERGE into warehouse (upsert by primary key)
    ▼
[Snowflake Warehouse Table]
    │
    ▼
[Audit Log: pipeline_runs]

Idempotency design:

SQL
-- Idempotency pattern using a staging table + MERGE
-- Step 1: Truncate staging (safe to repeat)
TRUNCATE TABLE orders_staging;

-- Step 2: Load today's extract into staging
COPY INTO orders_staging
FROM @s3_stage/orders/2026-05-07/
FILE_FORMAT = (TYPE = PARQUET);

-- Step 3: MERGE into warehouse (idempotent — same result on repeat)
MERGE INTO orders_warehouse AS target
USING orders_staging AS source
ON target.order_id = source.order_id

WHEN MATCHED AND source.updated_at > target.updated_at
THEN UPDATE SET
    target.status    = source.status,
    target.revenue   = source.revenue,
    target.updated_at = source.updated_at

WHEN NOT MATCHED
THEN INSERT (order_id, customer_id, status, revenue, created_at, updated_at)
VALUES (source.order_id, source.customer_id, source.status,
        source.revenue, source.created_at, source.updated_at);

-- Step 4: Log the run (also idempotent with MERGE)
MERGE INTO pipeline_runs AS t
USING (SELECT 'orders' AS pipeline, '2026-05-07' AS run_date) AS s
ON t.pipeline = s.pipeline AND t.run_date = s.run_date
WHEN MATCHED THEN UPDATE SET t.last_run_ts = CURRENT_TIMESTAMP, t.status = 'success'
WHEN NOT MATCHED THEN INSERT VALUES (s.pipeline, s.run_date, CURRENT_TIMESTAMP, 'success');

Q3: Design a Data Warehouse for E-Commerce

Question: Design the data warehouse schema for an e-commerce company. Must support: sales analytics, customer lifetime value, product performance, and marketing attribution.

Key dimensions: Star schema design, SCD handling, fact table granularity.

Schema Design:

                 ┌────────────────┐
                 │  dim_date      │
                 │  date_key PK   │
                 │  day, week     │
                 │  month, quarter│
                 └───────┬────────┘
                         │
┌──────────────┐         │         ┌──────────────────┐
│  dim_customer│         │         │  dim_product      │
│  customer_key│◄────────┤         │  product_key      │
│  customer_id │         │         │  product_id       │
│  email       │         │         │  name, category   │
│  tier        │         │         │  unit_cost        │
│  valid_from  │         │         └────────┬─────────┘
│  valid_to    │    ┌────▼─────┐            │
│  is_current  │    │ fact_    │◄───────────┤
└──────────────┘    │ orders   │            │
                    │          │   ┌────────┴──────────┐
┌──────────────┐    │order_key │   │  dim_channel       │
│  dim_channel │◄───│date_key  │   │  channel_key       │
│  channel_key │    │customer_k│   │  name, type        │
│  name, type  │    │product_ke│   └───────────────────┘
└──────────────┘    │channel_ke│
                    │revenue   │
                    │quantity  │
                    │discount  │
                    │margin    │
                    └──────────┘

Design decisions to discuss in the interview:

  1. Granularity: fact_orders is at order-line level (one row per product per order). This allows product-level analysis without joining order_items. For marketing attribution, add a separate fact_touchpoints table.

  2. SCD Type 2 on dim_customer: Use valid_from, valid_to, is_current to track tier changes over time. This lets you answer "what tier was this customer when they made this purchase?"

  3. Degenerate dimensions: order_id lives directly in the fact table (degenerate dimension) since you don't need a dim_order table — the order header information is fully captured in the fact row.

  4. Slowly Changing Dimension Type 1 vs 2:

    • Type 1 (overwrite): product price — you want current price for margin calculations
    • Type 2 (history): customer tier, address — you want historical context for past orders

Q4: Design a Data Quality Monitoring System

Question: Design a system that continuously monitors data quality across 50+ pipeline tables and alerts on anomalies.

Architecture:

[All Pipeline Tables in DW]
    │
    ▼
[Quality Check Scheduler — Airflow]
    │  Runs quality checks per table after each pipeline load
    │
    ▼
[Quality Engine — Python service]
    │  Runs: null rates, row count anomalies, distribution drift,
    │         referential integrity, freshness checks
    │
    ▼
[quality_checks_log table — Snowflake]
    │  Columns: table, check_type, run_id, passed, observed_value, threshold, ts
    │
    ├──────────────────────┐
    ▼                      ▼
[Quality Dashboard]  [Alert Router]
  (Grafana/Superset)   │
                        ├── Slack (warnings)
                        └── PagerDuty (critical failures)

Key design decisions:

  • Check configuration as code: Store thresholds in YAML per table. Version-controlled. No UI needed to change thresholds.
  • Escalation logic: First failure → Slack warning. Three consecutive failures → PagerDuty. This reduces alert fatigue from transient issues.
  • Check parallelism: Run checks for different tables in parallel. Each check must be idempotent (tagged with run_id).
  • Baseline learning: For row count and mean checks, auto-calculate the baseline from the last 30 runs on first deployment. Re-baseline on explicit approval after intentional schema changes.

Q5: Design the Ingestion Architecture for 50 Microservices Sending Events

Question: 50 microservices each emit domain events (order placed, payment processed, inventory updated). Design an ingestion architecture that's decoupled, scalable, and allows each service team to evolve their schema independently.

Architecture:

[Microservice A]  [Microservice B]  [Microservice N]
       │                 │                 │
       └─────────────────┴─────────────────┘
                         │
                         ▼
              [Kafka — per-domain topics]
               orders.placed
               payments.processed
               inventory.updated
               (partitioned by entity_id)
                         │
                    ┌────┴────┐
                    ▼         ▼
             [Schema       [Dead Letter
              Registry]     Queue Topic]
              (Confluent)
                    │
                    ▼
           [Kafka Connect — S3 Sink]
           or [Spark Structured Streaming]
                    │
                    ▼
           [Delta Lake — Raw Events]
           Partition: domain / date / hour

Schema evolution strategy: Use Apache Avro schemas registered in Confluent Schema Registry. Configure compatibility as BACKWARD — new schema versions must be readable by old consumers. This allows producer teams to add fields without breaking the ingestion pipeline.

Dead letter queue: Events that fail schema validation or deserialization go to events.dlq topic. A monitoring job alerts if DLQ lag grows above threshold.


Q6: How Would You Handle Late-Arriving Data?

Question: Your pipeline aggregates daily sales metrics. An upstream system sometimes sends corrected events 2-3 days late. How do you handle this?

Key dimensions: Reprocessing strategy, idempotency, downstream impact.

Answer: Three strategies, applied based on business requirements:

Strategy 1 — MERGE / Upsert (for small correction volumes):

SQL
-- Accept corrections up to 7 days late
-- Use MERGE to correct the affected date's aggregates
MERGE INTO daily_sales_agg AS target
USING (
    SELECT sale_date, SUM(revenue) AS daily_revenue
    FROM sales_raw
    WHERE processed_at >= CURRENT_TIMESTAMP - INTERVAL '1 HOUR'
    GROUP BY sale_date
) AS source
ON target.sale_date = source.sale_date
WHEN MATCHED THEN UPDATE SET target.daily_revenue = source.daily_revenue,
                              target.last_updated   = CURRENT_TIMESTAMP;

Strategy 2 — Watermark with grace period (Spark Streaming):

Python
# Allow 3 days of late data before closing the window
aggregated = (
    events_stream
    .withWatermark("event_time", "3 days")
    .groupBy(
        window("event_time", "1 day"),
        "product_id"
    )
    .agg(F.sum("revenue").alias("daily_revenue"))
)

Strategy 3 — Partition reprocessing (for full corrections): Drop and reload the affected date's partition. This is the safest approach if corrections are substantial:

Python
# Idempotent partition replace
df.write.mode("overwrite") \
    .option("replaceWhere", "sale_date = '2026-05-04'") \
    .format("delta") \
    .save(DELTA_PATH)

Q7: How Do You Ensure Exactly-Once Processing?

Question: Explain how you guarantee exactly-once semantics in a Kafka → Spark → Delta Lake pipeline.

Answer: Exactly-once is achieved by combining idempotent producers, transactional consumers, and atomic writes.

Kafka Producer:
  - enable.idempotence=true
  - acks=all
  - retries=MAX_INT

Kafka Consumer (Spark):
  - Read committed offsets from Kafka
  - Process batch
  - Write to Delta Lake (atomic)
  - Commit Kafka offsets ONLY after successful Delta write

Delta Lake:
  - Each write is an ACID transaction
  - Delta transaction log ensures atomicity
  - If job fails after writing but before committing offsets:
    - Restart re-reads from last committed offset
    - Delta's idempotency key prevents duplicate rows
Python
# Delta Lake write with idempotency key
(
    processed_df
    .write
    .format("delta")
    .mode("append")
    .option("txnAppId", "clickstream_pipeline")
    .option("txnVersion", batch_id)  # batch_id from Spark Structured Streaming
    .save(DELTA_PATH)
)
# Spark Structured Streaming + Delta gives exactly-once end-to-end
# when using foreachBatch with Delta MERGE

Q8: Explain the Medallion Architecture and When to Deviate

Question: What is the medallion architecture? When would you not use it?

Medallion Architecture:

Raw Sources
    │
    ▼
[Bronze Layer]  ← Raw data, as-is from source
    │              Preserve everything, no transformations
    │              Schema: source schema + metadata columns
    │              (ingested_at, source_file, batch_id)
    ▼
[Silver Layer]  ← Cleaned, validated, deduplicated
    │              Conformed data types
    │              PII masked
    │              Failed quality checks quarantined
    ▼
[Gold Layer]    ← Business-level aggregations
                   Dimensional models (star schema)
                   Ready for BI tools and ML features

When to deviate:

  1. Simple pipelines with 1-2 transformations: A two-layer (raw + curated) architecture is simpler and cheaper to maintain. Don't add Bronze-Silver-Gold ceremony for a 5-table pipeline.

  2. Real-time requirements: Medallion is fundamentally batch-oriented. If you need sub-second data, add a streaming path that bypasses Bronze/Silver and writes directly to a real-time store (Redis, ClickHouse).

  3. When Gold tables become the canonical source: In a lakehouse with Delta Lake and Snowflake, sometimes Gold in Delta IS the data warehouse. Don't add a fourth layer by also loading to Snowflake unless there's a specific reason.


Q9: How Do You Handle Schema Changes in Production?

Question: A source system adds 3 new columns and renames 1 existing column overnight. Your pipeline breaks. How do you design against this?

Answer: Schema change resilience requires contracts at the source, detection in the pipeline, and graceful handling.

Detection:

Python
def detect_schema_changes(
    current_schema: dict,
    expected_schema: dict,
) -> dict:
    current_cols  = set(current_schema.keys())
    expected_cols = set(expected_schema.keys())

    added     = current_cols - expected_cols
    removed   = expected_cols - current_cols
    type_changes = {
        col: (expected_schema[col], current_schema[col])
        for col in current_cols & expected_cols
        if current_schema[col] != expected_schema[col]
    }
    return {"added": added, "removed": removed, "type_changes": type_changes}

Response strategy by change type:

| Change | Risk | Strategy | |---|---|---| | New nullable column added | Low | Auto-evolve Delta schema (mergeSchema=true), backfill with NULL | | Column renamed | High | Alert + pause pipeline; update mapping and backfill | | Column type widened (INT→BIGINT) | Low | Auto-evolve | | Column type narrowed (BIGINT→INT) | High | Alert + manual review | | Column removed | High | Alert; keep in warehouse with NULL going forward |

Python
# Delta Lake schema evolution  safe for additive changes
df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save(DELTA_PATH)

Q10: Design a Cost-Efficient Snowflake Architecture

Question: Your Snowflake bill is $80k/month. Design an architecture that reduces it by 40% without impacting query performance for end users.

Cost Reduction Architecture:

Current (expensive):
[All queries → 1 XL warehouse running 24/7]
    Monthly compute: ~$60k
    Storage:          ~$20k

Optimised:
┌─────────────────────────────────────────────┐
│  Warehouse Strategy                          │
│                                             │
│  DEV_WH (XS)    — developer queries         │
│  REPORTING_WH (S) — BI tool queries (auto-  │
│                     suspend 60s, auto-start) │
│  PIPELINE_WH (M)  — dbt/pipeline runs only  │
│  ADHOC_WH (L)     — data science (manual    │
│                     start only)             │
└─────────────────────────────────────────────┘

Storage Optimisations:
- Enable auto-clustering on high-query tables
- Set data retention to 7 days (from 90) for staging tables
- DROP unused clones older than 3 days
- Move infrequently queried data to Iceberg on S3

Specific levers:

  1. Warehouse right-sizing: Most BI queries run fine on Small. Use Large only for complex joins/aggregations. Enforce via resource monitors.

  2. Query result cache: Ensure repeated identical queries hit Snowflake's result cache. Avoid parameterising queries with CURRENT_TIMESTAMP in the SELECT (it busts the cache).

  3. Materialised views: Pre-compute expensive aggregations used in dashboards. Snowflake automatically refreshes them.

  4. Table partitioning (clustering): On large tables queried by date range, set clustering key to order_date. Reduces bytes scanned per query.

  5. Zero-copy clones for dev/test: Use CREATE TABLE dev_orders CLONE orders for development — no storage cost for the clone until rows diverge.


Questions 11-20: Additional Topics

Q11: Design a Metadata and Data Lineage System

Key components: metadata store (Apache Atlas or DataHub), automatic lineage extraction from Airflow DAGs and SQL transformations, column-level lineage for GDPR right-to-erasure compliance.

Q12: Multi-Region Data Replication Strategy

Active-active vs active-passive. Use Kafka MirrorMaker 2 for cross-region topic replication. Design for RPO=0, RTO=15 minutes. Conflict resolution with last-writer-wins using event timestamps.

Q13: Design an ML Feature Store

Online store (Redis): low-latency serving for real-time models. Offline store (Delta Lake): historical feature values for training. Feature computation pipeline: Spark for batch features, Flink for streaming features. Point-in-time correct feature joins for training data.

Q14: Data Mesh Architecture

Domain-oriented decentralised data ownership. Each domain team owns their data products. Central platform provides: storage infrastructure, data catalogue, quality standards. Trade-off: consistency vs. autonomy.

Q15: Designing for GDPR Right to Erasure

Pseudonymise PII at ingest (hash email with rotating salt). Store salt in a separate key management service. To erase: delete the salt for the user — all their data becomes irrecoverable. Never store PII in partition keys or Parquet file names.

Q16-20: Additional System Design Topics

  • Backfill strategy for a 5-year historical load with minimal warehouse disruption
  • Designing a chargeback system for cloud data infrastructure costs
  • Handling PII in a multi-tenant data platform
  • Designing a data catalog with automated schema documentation
  • Managing 1000+ dbt models: modularity, CI/CD, testing strategy

What Interviewers Are Scoring

Breadth: Can you cover all layers (ingest → process → store → serve → monitor)?

Depth on trade-offs: Why Kafka over Kinesis? Why Delta Lake over Iceberg here? What breaks at 10x scale?

Operability: How do you backfill? What happens when the pipeline fails? How do you know it failed?

Cost awareness: Cloud costs are real. Interviewers want engineers who think about bytes scanned, warehouse sizing, and storage tier selection.

Evolution: How does your design handle schema changes? Adding new data sources? New consumers?

Lead with requirements, then design top-down, then dive into the component the interviewer cares most about. Always offer trade-offs — there's no single right answer.