System Design Interview Deep Dive — Worked Examples, Trade-offs, and Common Mistakes

This article assumes you know the basic framework (clarify, estimate, design, deep-dive, bottlenecks). Here we work through three complete examples end-to-end, with the level of trade-off reasoning that separates senior from mid-level answers.

The 5-Step Framework (Reminder)

Clarify requirements (5 min) — functional and non-functional
Estimate scale (5 min) — back-of-envelope for storage, throughput, bandwidth
High-level design (10 min) — major components and data flow
Deep dive (20 min) — 2–3 interesting sub-problems in detail
Bottlenecks and failure modes (5 min) — what breaks first, how you'd fix it

Worked Example 1: URL Shortener

Requirements

Functional:

Shorten a long URL → return a 7-char code
Redirect short code → original URL
Optional: click analytics (count per URL)

Non-functional (agreed with interviewer):

100M redirects/day, 10M new URLs/day
Redirect latency: p99 under 50ms
Data durability: URLs must never be silently lost
No URL expiry for v1

Scale Estimates

Redirects: 100M/day ÷ 86,400s = ~1,200/sec (peak 3x = ~3,600/sec)
Writes:    10M/day ÷ 86,400s = ~115/sec

Storage: 10M URLs/day × 365 days × 200 bytes/record = ~730 GB/year
Redis cache (hot 20%): 10M × 200B = ~2 GB → fits on a single node

Read:write = 100M:10M = 10:1 → heavily read-biased → cache aggressively

High-Level Design

POST /shorten → API Server → Counter Service → encode → Postgres → return code
GET  /{code}  → API Server → Redis lookup → (hit) 301 redirect
                                          → (miss) Postgres → cache → 301 redirect
                                          → async: Kafka → Analytics Service → ClickHouse

Deep Dive 1: Short Code Generation

Option A: Hash the URL

MD5(long_url) → take first 7 chars of hex → "a3f9b12"

Problem: collision probability grows with URL count. At 1B URLs, ~0.07% collision rate — that's 700K broken redirects. Need collision detection and retry → complex, latency hit.

Option B: Auto-increment ID → base62 encode

PostgreSQL SERIAL → 1, 2, 3, ... → base62 encode
ID 1,000,000 → base62 = "4c92" (4 chars)
ID 3,521,614,606,208 → base62 = "zzzzzzz" (7 chars) = 3.5 trillion URLs

Problem: predictable. Anyone can enumerate sho.rt/0000001, sho.rt/0000002 and scrape all URLs. Also: single database as the counter is a bottleneck at high write rates.

Option C: Pre-allocated counter ranges (recommended)

Each API server requests a block of 1,000 IDs from Redis: INCRBY counter 1000
Server uses IDs in memory — no further Redis calls per URL
ID → bit-shuffle (not reversible) → base62 encode → not guessable
On server restart: wasted IDs in current range are acceptable (< 0.01% waste)

Trade-off: IDs are not sequential in the database (bit-shuffled), making range scans harder — acceptable since we never do range scans on URLs, only point lookups by code.

Deep Dive 2: Redirect Path at 3,600 req/sec Under 50ms

1. DNS lookup → cached by browser after first visit
2. CDN (CloudFront / Azure Front Door):
   - Cache 301 redirects at CDN edge for popular URLs
   - Hit rate ~60% → eliminates 60% of traffic from origin
3. Origin: Redis cache lookup → O(1) → target URL
   - Cache TTL: 24h for URLs older than 1 day, 1h for new URLs
   - Hit rate: additional 30% → total 90% served from cache/CDN
4. Database: only for cache misses (~10%)
   - Index: hash index on short_code column (O(1) lookup)

At 3,600 redirect/sec with 90% cache hit: ~360 PostgreSQL queries/sec — easily handled by a single replica pair.

Deep Dive 3: Analytics Without Blocking Redirects

Never write to analytics on the redirect path — adds latency and creates a dependency.

On redirect: publish {short_code, ts, user_agent, ip_hash} to Kafka
Analytics Aggregator:
  - Reads Kafka, batches every 5 seconds
  - Bulk inserts to ClickHouse
  - ClickHouse is a columnar store: COUNT(*) over billions of rows in milliseconds

API query: SELECT count(*) FROM clicks WHERE short_code = 'abc123'
  → ClickHouse: 1-5ms on billions of rows

Bottlenecks and Failure Modes

| Component | Failure mode | Mitigation | |---|---|---| | Redis | Goes down | Circuit breaker → fall through to Postgres; degraded but functional | | Counter service | Redis restart | Pre-allocated ranges in memory for 1,000 IDs; brief outage acceptable | | Postgres | Primary fails | Read replica promoted (60-120s); CDN cache serves popular URLs during failover | | Kafka | Consumer lag | Analytics is delayed, not lost; add consumer instances |

Worked Example 2: Real-Time Chat

Requirements

Functional:

1:1 and group chat (max 500 members)
Real-time delivery (under 100ms)
Message history: 7 years
Delivery and read receipts

Non-functional:

50M DAU, average 40 messages/day
50M × 40 / 86,400 = ~23,000 messages/sec

Scale Estimates

Messages:    23,000/sec writes, assume 10x reads = 230,000/sec
Storage:     23,000 msg/sec × 86,400s × 365 days × 7 years × 1KB/msg = ~5 PB
             (with compression: ~1 PB)
Connections: 50M DAU × 20% concurrent = 10M open WebSocket connections
             10,000 connections/server → 1,000 WebSocket servers

High-Level Design

Client A → Load Balancer → WebSocket Server A ──────────────────→ Client B
                               ↓                                      ↑
                         Message Service                       WebSocket Server B
                               ↓                                      ↑
                           Kafka                               Redis Pub/Sub
                               ↓                ─────────────────────┘
                         Cassandra         Fanout Service reads Kafka,
                        (message store)    publishes to Redis channel per user

Deep Dive 1: WebSocket Connection Routing

When Client A sends a message to Client B, WebSocket Server A doesn't know which server Client B is on.

Solution: Redis Pub/Sub with user channels

Client B connects → WebSocket Server B subscribes to Redis channel "user:{B_id}"

Message from A to B:
  1. Server A receives message
  2. Server A publishes to Redis: PUBLISH user:{B_id} {message_json}
  3. Redis delivers to all servers subscribed to "user:{B_id}"
  4. Server B receives → sends to Client B's open connection

If Client B is offline:
  5. Server A also writes to "pending_messages:{B_id}" in Redis or Cassandra
  6. On reconnect, Client B polls for pending messages

Deep Dive 2: Message Storage — Why Cassandra

Requirements: 23,000 writes/sec, read by conversation ID, time-ordered, 7 years retention

PostgreSQL: vertical scaling, row-level locks under write load, single-node limit Cassandra: horizontal scaling, tunable consistency, designed for time-series append workloads

Partition key:   (conversation_id)    → all messages for a conv on one node
Clustering key:  (message_id DESC)    → newest first, fast range scans
Columns:         sender_id, content, created_at, type

Trade-off: Cassandra sacrifices ad-hoc queries (no joins, no GROUP BY). Acceptable — we only ever query by conversation_id.

Message ID: Snowflake or ULID — time-sortable, globally unique, generated by the API server without a database round-trip.

Deep Dive 3: Delivery and Read Receipts

"Sent":     Server received message → immediate 200 OK + message_id to sender
"Delivered": WebSocket Server B forwarded to Client B → B's server publishes
              ack to Redis → Server A forwards to Client A
"Read":     Client B opens conversation → sends READ event with last_read_message_id
            → Server publishes to Redis → Client A updates UI

Read receipts are high-volume. Batch them: client sends READ event at most once per second per conversation, server debounces at 500ms.

Bottlenecks and Failure Modes

10M concurrent WebSocket connections: OS file descriptor limit (~65K per process) — run multiple processes per host, or use an async framework (Node, Go, .NET Kestrel) that handles 100K+ connections per process.

Redis Pub/Sub fan-out in group chats: 500 members = 500 Redis publishes per message. At 23,000 msg/sec in 500-person groups: 11.5M Redis publishes/sec — too high. Solution: batch group messages, or use a dedicated group message queue.

Worked Example 3: Social Media Feed

Requirements

500M users, 1M posts/day
Feed shows posts from followed accounts, newest first (v1 — no ranking algorithm)
80% of users follow fewer than 200 accounts; 1% follow more than 10,000
Feed load time under 200ms

Scale Estimates

Posts:    1M/day = ~12/sec
Feed reads: 500M users × 5 feed loads/day = 2.5B reads/day = ~29,000/sec
Post views need multiplied by followers:
  Average 200 followers × 12 posts/sec = 2,400 fanout writes/sec → manageable
  Celebrity: 50M followers × 1 post → 50M fanout writes for 1 post → not manageable

Fan-Out Decision

| Strategy | Write cost | Read cost | Celebrity problem | |---|---|---|---| | Write (push) | High — write to all followers | O(1) read | 50M writes per post | | Read (pull) | O(1) write | High — query all followed users | Naturally handled | | Hybrid | Medium | O(1) for pre-computed feeds | Manageable |

Hybrid strategy:

Regular users (< 10K followers): fanout on write — post goes into followers' feed caches immediately via Kafka consumer
Celebrities (> 10K followers): fanout on read — feed reader pulls celebrity posts from their timeline and merges with pre-computed feed

Data Model

Feed cache (Redis ZSET):
  Key: feed:{user_id}
  Score: post_timestamp (Unix ms)
  Member: post_id
  Size: capped at 500 entries

User timeline (Cassandra):
  Partition: (user_id)
  Clustering: (post_id DESC)
  Used for: celebrity post retrieval, feed rebuild after cache miss

Feed Read Path

1. ZRANGE feed:{user_id} 0 49 → 50 most recent post IDs from Redis
2. MGET post:{id} for each post_id from Redis (post detail cache)
3. Fetch followed celebrities (< 5% of users follow any celebrity)
   → ZRANGEBYSCORE timeline:{celebrity_id} {48h_ago} +inf → last 48h posts
4. Merge: sort by timestamp, deduplicate, return top 50

Bottlenecks and Failure Modes

Cache cold start: User hasn't opened the app in 30 days — their feed cache is empty. Solution: rebuild feed on first request from Cassandra (fan-out on read for this user only), cap rebuild at last 500 posts from followed accounts.

Fanout lag spike: 1M users all post at 9am Monday. Kafka consumer lag grows. Solution: Kafka with 100 partitions and 100 consumer threads — fanout throughput scales horizontally.

Trade-Off Patterns to Know Cold

SQL vs NoSQL

Use SQL (PostgreSQL) when:

You need ACID transactions
Your access patterns involve joins and aggregations
Schema is well-defined and stable

Use NoSQL when:

Writes exceed 10,000/sec and you need horizontal scale
Access patterns are by a single key (no joins)
You need flexible schema (documents) or time-series (wide-column)

Sync vs Async Processing

Sync (inline): user waits for result → simple, consistent, but slower
Async (queue): user gets immediate ack, work done later → better UX, but eventual consistency

Rule: if the user needs the result to continue, do it sync. If it's a side effect (email, analytics, cache update), do it async.

Cache Strategies

| Strategy | Write path | Read path | Stale risk | |---|---|---|---| | Cache-aside | Write to DB only | Read from cache, miss → DB → cache | Possible | | Write-through | Write to DB + cache | Always from cache | Low | | Write-behind | Write to cache, async to DB | Always from cache | High (data loss risk) |

Cache-aside is the most common and safest — the application controls the cache explicitly.

Mistakes That Fail Senior Candidates

Not knowing the numbers: "We'd add more servers" — how many? What load does each handle? What's the cost? Numbers reveal that you've done this before.

Picking technology before understanding the problem: "We should use Kafka" before you know the message volume or whether async is even needed. Tech choice follows requirements.

Ignoring the CAP trade-off: For any distributed data store, you'll give up either consistency or availability under partition. State which you choose and why — it's not a trick question, it's showing you understand distributed systems.

Over-engineering for day 1: A system for 100K users doesn't need Cassandra and a Kafka cluster. PostgreSQL with a read replica and Redis cache handles 50K requests/sec. Start simple, know when to add complexity.

Not finishing: Running out of time in the deep-dive and never addressing bottlenecks. Practice timing. Interviewers penalise incomplete answers more than imperfect ones.

System Design Interview Deep Dive — Worked Examples, Trade-offs, and Common Mistakes

System Design Interview Deep Dive — Worked Examples, Trade-offs, and Common Mistakes

The 5-Step Framework (Reminder)

Worked Example 1: URL Shortener

Requirements

Scale Estimates

High-Level Design

Deep Dive 1: Short Code Generation

Deep Dive 2: Redirect Path at 3,600 req/sec Under 50ms

Deep Dive 3: Analytics Without Blocking Redirects

Bottlenecks and Failure Modes

Worked Example 2: Real-Time Chat

Requirements

Scale Estimates

High-Level Design

Deep Dive 1: WebSocket Connection Routing

Deep Dive 2: Message Storage — Why Cassandra

Deep Dive 3: Delivery and Read Receipts

Bottlenecks and Failure Modes

Worked Example 3: Social Media Feed

Requirements

Scale Estimates

Fan-Out Decision

Data Model

Feed Read Path

Bottlenecks and Failure Modes

Trade-Off Patterns to Know Cold

SQL vs NoSQL

Sync vs Async Processing

Cache Strategies

Mistakes That Fail Senior Candidates

Enjoyed this article?

Leave a comment