System Design · Lesson 26 of 26

System Design Interview: 40 Questions & Model Answers

These are the questions that actually appear in system design interviews, organised from foundational concepts to architect-level trade-offs. Each answer is a concise model — expand on any that match your target role.


Foundations (Junior → Mid)

1. What is horizontal vs vertical scaling? When would you choose each?

Vertical scaling (scaling up) means adding more resources to a single machine — more CPU, RAM, faster disk. Horizontal scaling (scaling out) means adding more machines and distributing the load.

Choose vertical first when your bottleneck is a single process that can't easily be distributed (e.g., a heavily stateful service). Choose horizontal when you need availability (no single point of failure), when you've hit hardware limits, or when your workload is stateless and easy to distribute. Most production systems use both — vertical up to a point, then horizontal.


2. What is a CDN and when should you use it?

A Content Delivery Network is a globally distributed network of edge servers that cache content close to users. Use it for: static assets (JS, CSS, images), read-heavy content that doesn't change per-user (product pages, documentation), and video streaming. Don't use it for personalised or authentication-required content — CDNs cache globally, so they serve the same response to everyone.


3. What is the difference between latency and throughput?

Latency is the time to complete one operation (e.g., 50ms to fetch a record). Throughput is the number of operations per unit time (e.g., 10,000 requests/second). Optimising one doesn't always improve the other — batching improves throughput but increases per-item latency. Most systems need to optimise for both, and the right trade-off depends on the use case.


4. What is caching and what are the main invalidation strategies?

Caching stores results of expensive operations in fast storage (usually memory) to serve future requests faster. Main invalidation strategies:

  • TTL (time-to-live): Cache expires after N seconds. Simple, but stale data is possible.
  • Write-through: Update cache on every write. Cache is always fresh, but writes are slower.
  • Write-behind (write-back): Write to cache immediately, sync to DB asynchronously. Fast writes, risk of data loss if cache fails before sync.
  • Cache-aside (lazy loading): Application checks cache first; on miss, reads from DB and populates cache. Most common pattern.
  • Event-driven invalidation: A message is published on write; cache subscribers evict the key. Accurate but complex.

5. What is a load balancer and what algorithms does it use?

A load balancer distributes incoming requests across multiple servers. Common algorithms:

  • Round-robin: Requests distributed sequentially. Good for homogeneous servers.
  • Least connections: Route to server with fewest active connections. Good when requests vary in processing time.
  • IP hash: Same client always routes to same server. Useful for session stickiness, but can create imbalance.
  • Weighted: Servers with more capacity receive proportionally more traffic.

Layer 4 (TCP) load balancers operate at the transport level — fast but no HTTP awareness. Layer 7 (HTTP) can route based on URL path, headers, or cookies — more flexible, slightly more overhead.


6. What is the difference between SQL and NoSQL? When do you choose each?

SQL (relational) databases enforce a schema, support ACID transactions, and excel at complex queries with joins. Choose when: you need transactions across multiple entities, your data relationships are complex, your query patterns are diverse and unpredictable.

NoSQL databases trade some of this for horizontal scalability, flexible schemas, and optimised access patterns. Choose when: you have a specific, known access pattern (key-value, document, time-series); you need to scale writes horizontally; your data doesn't fit a tabular model well.

The most important rule: pick based on your access pattern, not hype. Cassandra is excellent for time-ordered writes with partition-key reads. Redis is excellent for cache and pub/sub. PostgreSQL handles 99% of business application needs.


7. What is eventual consistency vs strong consistency?

With strong consistency, after a write completes, every subsequent read reflects that write — no matter which replica serves the read.

With eventual consistency, after a write, reads may return stale data for some window of time, but all replicas will eventually converge to the latest value.

Strong consistency is easier to reason about but limits availability during network partitions (CAP theorem). Eventual consistency enables higher availability and throughput but requires the application to handle stale reads. Use strong consistency for financial transactions, inventory counts. Use eventual consistency for social media feeds, view counts, recommendation systems.


8. What is database sharding and when do you need it?

Sharding splits your database into multiple partitions (shards), each holding a subset of the data. A shard key determines which shard a row belongs to.

You need sharding when a single database node can't handle your write throughput or total data volume — typically millions of writes/second or terabytes of data. Before sharding, exhaust: read replicas (offload reads), vertical scaling (more RAM/CPU), caching (reduce DB load), and archival (move old data out).

Sharding introduces complexity: cross-shard queries and transactions become difficult; rebalancing shards when adding nodes requires data migration. Common shard keys: user ID (for user-scoped data), date (for time-series).


9. What is a message queue and why use one?

A message queue decouples producers (who generate work) from consumers (who process it). Benefits:

  • Absorb traffic spikes: Queue buffers work during peak; consumers process at steady rate
  • Decoupling: Producer doesn't wait for the consumer to finish
  • Retry: Failed messages can be re-queued
  • Fan-out: One message can be consumed by multiple services

Use cases: email delivery, notification fan-out, order processing, async job execution. Common options: AWS SQS (managed, simple), Kafka (high throughput, log retention, replay), RabbitMQ (rich routing, lower throughput).


10. What is the CAP theorem?

CAP theorem states a distributed system can guarantee at most two of: Consistency (every read sees the latest write), Availability (every request gets a response), and Partition Tolerance (system works even if network partitions occur between nodes).

Since network partitions happen in any real distributed system, you're really choosing between CP (consistent but may reject requests during partition) and AP (always available but may return stale data). Most systems choose AP for general reads and CP for critical writes (e.g., balance checks).


Intermediate (Mid → Senior)

11. How would you design a caching layer for a high-traffic API?

Layer the cache: CDN for public, cacheable responses; Redis in front of the DB for dynamic data. Cache the results of expensive queries, not raw DB rows. Set TTLs based on how stale the data can be. Use cache-aside pattern: check cache, on miss read DB, populate cache. Handle cache stampede (thundering herd) with mutex locks or probabilistic early expiry. Monitor hit rate — below 80% suggests you're caching the wrong things.


12. What is the difference between optimistic and pessimistic locking?

Pessimistic locking acquires a lock before reading data to prevent concurrent modification. Safe but reduces throughput — row locks block other transactions.

Optimistic locking reads data without locking, includes a version number. On write, checks if the version has changed — if it has (someone else wrote), the transaction fails and must retry. Better throughput for low-contention scenarios; the retry cost is only paid when contention actually occurs.

EF Core supports optimistic concurrency via [ConcurrencyToken] or RowVersion columns. Use pessimistic for financial transactions where you can't afford retries; optimistic for most CRUD operations.


13. How do you handle database migrations without downtime?

Use expand-and-contract migrations:

  1. Expand: Add the new column/table (nullable or with default). Application writes to both old and new. Run data migration in background.
  2. Contract: Once all rows are migrated and the old column is unused, remove it.

Never rename or drop a column in a single deployment — the running app will break. Additive changes are safe; destructive changes require the two-phase approach. Tools: Flyway, Liquibase, EF Core migrations (with careful review of generated SQL).


14. What is the outbox pattern?

The outbox pattern ensures that a DB write and an event publish are atomic — you either do both or neither.

Instead of publishing directly to Kafka/RabbitMQ (which is a separate system and can fail independently), write the event to an outbox table in the same DB transaction as the business data. A separate process reads the outbox table and publishes events, then marks them as published.

SQL
BEGIN TRANSACTION;
INSERT INTO orders (id, status) VALUES (...);
INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '...');
COMMIT;
-- Background poller reads outbox, publishes to Kafka, marks sent

This guarantees exactly-once publish semantics (combined with idempotent consumers on the receiving side).


15. How would you scale a read-heavy service to 1M requests/second?

  1. CDN for public, cacheable content — serves majority of traffic without hitting origin
  2. Redis cache in front of the DB — sub-millisecond for hot data
  3. Read replicas — scale the DB read path horizontally
  4. Stateless API servers behind a load balancer — scale horizontally as needed
  5. Connection pooling — don't open a new DB connection per request
  6. Async for non-critical reads — return stale data from cache while refreshing in background

Start with caching. Most read-heavy systems can serve 90% of traffic from cache, reducing DB load to a manageable level.


16. What is consistent hashing and why is it used?

Consistent hashing distributes keys across nodes such that adding or removing a node only remaps a fraction of keys (1/N, where N is the number of nodes) — not all keys.

With simple modulo hashing (key % N), adding one node changes the mapping for almost every key, causing massive cache invalidation and DB pressure.

Used in: distributed caches (Redis Cluster), load balancers that need session stickiness, distributed storage (DynamoDB, Cassandra). Virtual nodes handle uneven distribution by giving each physical node multiple positions on the hash ring.


17. What is a circuit breaker and when do you use it?

A circuit breaker wraps calls to external services and monitors failure rate. If failures exceed a threshold, it "opens" — subsequent calls fail immediately without hitting the downstream service, giving it time to recover. After a timeout, it moves to "half-open" and tries a probe request.

Use it when: calling downstream services that may be slow or unavailable; you need to prevent cascade failures (a slow DB causing all threads to block, cascading into full service outage). Libraries: Polly (C#), Resilience4j (Java), Hystrix.


18. How do you design for idempotency in APIs?

An idempotent operation produces the same result whether called once or N times. GET, PUT, DELETE are naturally idempotent. POST is not — calling POST /orders twice creates two orders.

For POST, use idempotency keys: client generates a unique key (UUID) per request, sends it as a header (Idempotency-Key). Server stores the response for each key. On duplicate request with the same key, return the stored response without re-executing.

POST /payments
Idempotency-Key: 7b3c9a2e-...

→ First call: process payment, store result
→ Second call (retry): return stored result, do not charge again

Critical for payment APIs, order creation, any operation the client might retry on timeout.


19. What is a saga pattern?

Saga manages distributed transactions across multiple services without a two-phase commit (which requires all services to lock and coordinate, killing performance and coupling).

A saga is a sequence of local transactions. Each service performs its transaction and publishes an event. If a step fails, compensating transactions roll back previous steps.

Two implementations:

  • Choreography: Services emit events; other services react. Decoupled but hard to trace.
  • Orchestration: A central coordinator sends commands to each service. Easier to trace and reason about; the coordinator is a potential bottleneck.

Use saga when you need "eventual transaction" semantics across services — order placement (reserve inventory → charge payment → confirm order).


20. How does Kafka differ from a traditional message queue?

Traditional queues (SQS, RabbitMQ): messages are consumed and deleted. Each consumer gets each message once. Good for task queues.

Kafka is a distributed log: messages are persisted for a configurable retention period. Multiple consumer groups can read the same messages independently. Consumers track their own offset — they can re-read messages. Good for: event sourcing, audit trails, multiple consumers reading the same stream, replaying events after a bug.

Key concepts: topics (channels), partitions (parallelism unit), consumer groups (independent sets of consumers), offsets (position in the log).


Advanced (Senior → Architect)

21. How would you design a global distributed system with users in multiple regions?

Active-active multi-region: each region has its own stack (API + cache + DB). Use geo-DNS to route users to nearest region. Replicate data across regions asynchronously (accept eventual consistency for cross-region reads). For writes that must be globally consistent (balance, inventory), route to a "home" region or use a globally distributed DB (CockroachDB, Spanner).

Key decisions: conflict resolution strategy for concurrent writes to same data from different regions; cross-region latency for synchronous calls (usually > 100ms — avoid synchronous cross-region calls on the hot path).


22. How would you design the Twitter/X home feed?

Two approaches:

Fan-out on write (push): When a user tweets, immediately write to the feed of all followers. Feed reads are fast (just read pre-computed feed). Problem: celebrities with 50M followers — one tweet = 50M writes.

Fan-out on read (pull): When a user opens their feed, fetch tweets from all accounts they follow and merge. No write amplification, but feed reads are slow and expensive at scale.

Hybrid (what Twitter actually does): Fan-out on write for most users; pull for celebrities. When loading a feed, merge the pre-computed feed with recent posts from followed celebrities.


23. What is the difference between CQRS and event sourcing?

CQRS (Command Query Responsibility Segregation): Separate the write model (commands that mutate state) from the read model (queries that return data). Write model optimised for consistency; read model optimised for query performance (denormalised, projected views). They can use different stores.

Event sourcing: Instead of storing current state, store the sequence of events that led to the state. Current state is derived by replaying events. Provides a complete audit trail and enables temporal queries ("what was the state on Tuesday?").

They're often used together but are independent concepts. You can have CQRS without event sourcing (just separate read and write stores). Event sourcing without CQRS is unusual but possible.


24. How do you prevent thundering herd in a cache?

Thundering herd: when a popular cache key expires, thousands of requests simultaneously miss the cache and all hit the DB.

Solutions:

  • Mutex/lock: First miss acquires a lock and fetches from DB; other misses wait. Others get the cached value once the lock releases.
  • Probabilistic early expiry: Before the TTL expires, occasionally re-fetch the value proactively with some probability that increases as expiry approaches.
  • Background refresh: A background job refreshes hot keys before they expire. Application always reads from cache.
  • Jittered TTL: Add random jitter to TTLs (e.g., 3600 ± 300 seconds) so not all related keys expire simultaneously.

25. How do you handle distributed tracing across microservices?

Distributed tracing follows a request across service boundaries. Each request gets a unique trace ID. Each service adds a span (start time, end time, service name, parent span). Spans are collected by a tracing backend (Jaeger, Zipkin, AWS X-Ray).

Implementation in .NET: OpenTelemetry.Extensions.Hosting + OpenTelemetry.Instrumentation.AspNetCore. Trace IDs are propagated via HTTP headers (traceparent) automatically. Instrument DB calls, queue publishes, and outbound HTTP calls as child spans.

Key benefit: identify where latency is introduced across services — is it the API, the DB call, or the downstream service?


26. How would you implement distributed locking?

Use Redis SETNX (set if not exists) with a TTL for the lock:

SET lock:resource_X unique_token NX PX 30000

The token is unique per lock holder — on release, check the token matches before deleting (prevents releasing another holder's lock). The TTL ensures the lock releases if the holder crashes.

For more complex cases (lock renewal, fencing tokens to prevent stale holders from acting), use Redlock (runs across N independent Redis instances for fault tolerance) or a purpose-built system like Zookeeper or etcd.


27. What are the main patterns for service-to-service communication in microservices?

Synchronous: REST (HTTP) or gRPC. Easy to implement, immediate response, but creates temporal coupling — both services must be up simultaneously. Use for: queries that need an immediate response.

Asynchronous: Message queues (Kafka, SQS). Producer and consumer are decoupled in time. Producer doesn't wait. Use for: commands where the caller doesn't need an immediate result, notifications, fan-out.

Service mesh (Istio, Linkerd): Infrastructure-level handling of mutual TLS, retries, circuit breaking, observability — transparent to service code. Useful when you have 50+ services and don't want to implement retry/circuit-breaker logic in each.


28. How do you handle schema evolution in event-driven systems?

Consumers that read old events must not break when producers add new fields. And producers that write new events must not break consumers that haven't deployed yet.

Rules for safe schema evolution:

  • Add fields as optional — old consumers ignore unknown fields (forward compatible)
  • Never remove fields — old consumers that depend on the field break
  • Never change field types — a consumer expecting a string that now gets an int breaks

Enforce this with a schema registry (Confluent Schema Registry for Kafka). Producers register schemas; schemas must pass compatibility checks before a message can be published.


29. What is blue-green deployment vs canary deployment?

Blue-green: Run two identical environments (blue = current, green = new). Switch traffic all at once from blue to green. Easy rollback — just switch back. Requires double the infrastructure during the switch.

Canary: Route a small percentage of traffic (1%, 5%) to the new version. Gradually increase as confidence grows. Monitor error rates, latency. Roll back by routing 0% to the canary. Less infrastructure waste than blue-green; slower to complete a full rollout.

Use canary when: the change is risky, you have good observability, and you can tolerate some users getting the new version first. Use blue-green when: you need instant rollback capability and the change is all-or-nothing.


30. How would you design an audit log that can't be tampered with?

Write audit events to an append-only store — never update or delete. Options: Kafka (immutable log, configurable retention), a separate audit_log table with no UPDATE/DELETE grants, or AWS CloudTrail.

For tamper-evidence: hash chaining. Each log entry includes the hash of the previous entry. Any modification or deletion breaks the chain and is detectable.

For compliance (SOC 2, HIPAA): store audit logs in a separate account/region from application data. Restrict delete permissions to a break-glass role. Retain for the required period (often 7 years for financial records).


Classic Case Studies (Quick Reference)

31. Design a web crawler. Seeds queue → crawler workers fetch pages → extract links → deduplicate (Bloom filter or distributed set) → add new URLs to queue → store content in object storage. Scale: thousands of crawler workers, distributed queue (Kafka), robots.txt compliance, politeness delay per domain.

32. Design a video streaming platform (YouTube). Upload path: video → transcoding workers (multiple resolutions/formats) → CDN origin. Watch path: CDN edge → origin. Metadata (title, views) in PostgreSQL. Video segments in object storage (S3). Recommendation engine runs on separate analytics pipeline. Key insight: video bytes ≠ metadata — separate them.

33. Design an autocomplete/typeahead service. Trie in memory for top-N completions per prefix. Precompute top suggestions per prefix (top 5 by search frequency) and store in Redis. On each keystroke, Redis lookup by prefix. Update the trie/cache periodically from search logs (not in real-time — eventual is fine). Use prefix sharding for scale.

34. Design a distributed key-value store (like DynamoDB). Consistent hashing to distribute keys across nodes. Virtual nodes for even distribution. Replication factor N (typically 3) — write to N nodes. Quorum reads/writes (R + W > N for strong consistency). Vector clocks for conflict resolution. Anti-entropy with Merkle trees for replica synchronisation.

35. Design a ride-hailing dispatch system (Uber). Driver location: drivers send GPS updates every 4 seconds → geospatial store (Redis GEOADD). Rider request → find nearby drivers (GEORADIUS) → match algorithm → assign. Real-time communication over WebSocket. Trip events via Kafka. Key challenge: location data at 1M drivers updating every 4s = 250,000 writes/second — Redis Cluster handles this.


Rapid-Fire (Last 5 Minutes of Interview)

36. How does DNS work? Client queries recursive resolver → root nameserver → TLD nameserver → authoritative nameserver → returns IP. Cached at each level by TTL.

37. What is connection pooling? Reuse existing DB connections instead of opening a new one per request. Opening a connection is expensive (TCP handshake, auth). Pool maintains N open connections; requests borrow and return.

38. What is a reverse proxy? Server that sits in front of backend servers and forwards client requests. Provides: load balancing, SSL termination, caching, rate limiting. Nginx and Cloudflare are common reverse proxies.

39. What is two-phase commit? Protocol for atomic transactions across multiple nodes. Phase 1 (prepare): coordinator asks all nodes if they can commit. Phase 2 (commit): if all say yes, coordinator tells all to commit. Problem: blocking if coordinator crashes between phases. Rarely used in microservices — prefer saga instead.

40. What is the difference between a process and a thread? A process has its own memory space; threads share memory within a process. Thread switches are cheaper than process switches (no memory context switch). In modern cloud-native apps, the relevant analogy is: containers ≈ processes (isolated), goroutines/async tasks ≈ threads (shared within a process).