System Design Mental Models: How Real Systems Actually Work

Why Mental Models Beat Memorisation

System design interviews test whether you understand trade-offs — not whether you can recite architecture patterns. An interviewer asking "design a ride-sharing system" does not want you to list Uber's actual tech stack. They want to see you reason about scale, failure modes, and trade-offs in real time.

Mental models let you derive the right answer, rather than recall a memorised one.

1. Vertical vs Horizontal Scaling — "One Super-Chef vs Many Chefs"

Vertical Scaling

Make the single server more powerful. More RAM, faster CPUs, bigger disks.

Analogy: A single chef working faster. Buy them better knives, train them more, give them a bigger kitchen. There is a limit — a person can only work so fast, no matter how good the tools.

When it works: Databases that are hard to shard (PostgreSQL under moderate load), admin panels, batch jobs.

The ceiling problem: The most powerful single server available from any cloud provider caps around 24TB RAM and 448 vCPUs. Once you hit that ceiling, you cannot scale further. And until that point, it is expensive.

Horizontal Scaling

Add more servers. Distribute the load.

Analogy: Hire more chefs. Each handles a subset of orders. Unlimited theoretical scale — but now you need a maitre d' (load balancer) to direct customers, and all chefs must have access to the same recipe book (shared state: database, cache, session store).

The statelessness requirement: If Chef A memorises a customer's dietary preferences in his head, and the customer comes back but is seated at Chef B's station — Chef B has no idea. You must store state externally (Redis for sessions, database for everything else) so any server can handle any request.

Real Example: Netflix

Netflix serves approximately 15% of all North American internet traffic during peak hours (8–11pm EST). No single server could handle this. They run thousands of identical, stateless instances behind load balancers. Each request is self-contained: authentication via JWT (no session lookup), data from shared databases (DynamoDB, Cassandra). Adding capacity is adding another identical container.

2. Caching — "The Sticky Note on Your Monitor"

The Core Idea

Fetching data from a database is slow (milliseconds). Reading from memory is fast (microseconds). Cache frequently-read, rarely-changing data in memory so you do not hit the database on every request.

The Analogy

A reference manual sits on a bookshelf across the room. Every time you need a formula, you walk over, find the page, and walk back — 30 seconds each time.

You write the three formulas you use constantly on a sticky note on your monitor. Now it is instant.

The sticky note is the cache. The bookshelf is the database. The trade-off: the sticky note might be out of date if the book was revised.

Cache Invalidation — "When Does the Sticky Note Go Stale?"

This is the hardest problem in caching. Options:

| Strategy | Analogy | Use when | |---|---|---| | TTL (time-to-live) | Sticky note expires after 1 hour | Data changes infrequently, some staleness acceptable | | Write-through | Update sticky note every time you update the book | Read-heavy, low write volume | | Cache-aside | Check sticky note first; if missing, read book and write new note | General purpose — most common pattern | | Event-driven invalidation | Someone texts you "the book was updated, throw away your note" | When you need freshness and have a message bus |

Real Example: Reddit's Front Page

Reddit's front page was historically one of the most-hit database queries on the internet. Every visitor triggered a complex ranking query across millions of posts. Solution: pre-compute the front page every 30 seconds and cache it. 99.9% of requests hit the cache. The database sees only the 0.1% of requests that miss (cache cold start or expired TTL).

3. Database Sharding — "The Filing Cabinet Problem"

The Concept

A single database server can only hold so much data and handle so many queries. Sharding splits the data across multiple database servers based on a partition key.

The Analogy

A law firm has 10 years of client files in a single filing cabinet. It is full and staff are queuing to access it.

Vertical approach: Replace the cabinet with a larger one. Still a single point of contention.

Sharding: Buy 26 smaller cabinets labelled A–Z. Files for clients whose name starts with A–E go in cabinet 1. F–J in cabinet 2. And so on. Any query instantly knows which cabinet to open. No queue.

The Hot Partition Problem

What if 80% of your clients have surnames starting with S (a common surname in some regions)? Cabinet 19 is overloaded while most others are near-empty.

Solution: Hash-based sharding. Instead of sharding on the first letter (non-uniform), take hash(client_id) % number_of_shards. The hash function distributes values uniformly regardless of the actual value.

Real Example: Instagram's User Data

Instagram sharded user data early in their growth. Their sharding key was user_id. A consistent hash function mapped each user to one of 512 logical shards. When they needed more capacity, they split shards — adding capacity without changing the application's sharding logic.

4. The Read Replica Pattern — "The Photocopy Room"

The Analogy

Your company has one master copy of all contracts (primary database). Every time a lawyer wants to reference a contract, they must go to the vault. The vault is a bottleneck.

Solution: make photocopies of all contracts and put them in a reading room (read replicas). Lawyers read from the reading room. Only changes go to the vault. The vault's load drops dramatically.

Trade-off: Replication Lag

Photocopies are made every few minutes. If a contract was just amended, the reading room copy might be minutes old.

For reads where slight staleness is acceptable (reporting dashboards, analytics, product catalogue pages) — read replicas are ideal. For reads that must be immediately consistent after a write (bank balance after a transfer, inventory after a purchase) — always read from the primary.

Real Example: GitHub

GitHub uses MySQL with read replicas extensively. Code browsing, issue lists, and pull request views — all go to read replicas. Creating a pull request, merging code — these go to the primary. The read replicas handle the majority of GitHub's enormous read traffic, leaving the primary focused on writes.

5. Event-Driven Architecture — "The News Broadcast Model"

The Concept

Instead of Service A calling Service B directly when something happens, Service A publishes an event ("something happened"). Any interested service subscribes and reacts. Services have no knowledge of each other.

The Analogy

Direct calls (tight coupling): A mayor announces a road closure by personally phoning every affected resident. If one resident's phone is off, they never hear. The mayor must know every resident's number. If the city grows, the mayor's job grows linearly.

Event-driven (loose coupling): The mayor broadcasts on the local radio station: "Road closure on Main Street." Everyone who is interested and listening hears it. New residents automatically receive news by turning on their radio. The mayor does not need a contact list.

The Trade-off

Benefits:

Services are decoupled — can be developed and deployed independently
Adding a new consumer requires no changes to the producer
A slow consumer does not slow down the producer

Drawbacks:

Harder to trace a full transaction across services
Eventually consistent — consumers process events at their own pace
You must handle duplicate events (at-least-once delivery is typical)

Real Example: Uber's Trip Events

When you book a ride, Uber publishes a TripRequested event. The driver matching service consumes it to find a nearby driver. The pricing service consumes it to lock the price estimate. The notification service consumes it to send you "We're finding a driver." All three react independently. Adding a new feature (e.g., an ML fraud detection service) requires only subscribing to the event — no changes to the booking service.

6. The Outbox Pattern — "Sending the Letter After You Update the Ledger"

The Problem

A service must both update its database AND send a message to a queue. What if it writes to the database successfully but then crashes before sending the message? Or sends the message but then the database write fails? You have inconsistency.

The Analogy

A Victorian-era merchant writes a letter to a client AND updates the ledger. If he posts the letter but then his pen runs out before updating the ledger — the ledger is wrong. If he updates the ledger but forgets to post the letter — the client never knows.

The outbox solution: The merchant writes BOTH the ledger entry AND the letter in the same book simultaneously. A separate postboy checks the book periodically, finds unsent letters, posts them, and marks them sent. The two actions are now atomic — both succeed or neither does.

In Software

Transaction (atomic):
  1. UPDATE orders SET status = 'confirmed'
  2. INSERT INTO outbox (event_type, payload) VALUES ('OrderConfirmed', {...})

Separate background worker:
  - Poll outbox for unprocessed rows
  - Publish to message queue
  - Mark row as processed

If the service crashes after step 1 but before the worker runs, the outbox row is still there. The worker will publish the event when the service recovers. Exactly-once publishing guaranteed.

Real Example: Healthcare Booking Systems

Booking a patient appointment requires: updating the appointment database AND notifying the EHR system AND sending the patient a confirmation SMS. Without the Outbox pattern, a crash between any two of these leaves the system in an inconsistent state. The Outbox ensures all three happen, eventually — even if the service restarts mid-process.

7. Rate Limiting — "The Nightclub Bouncer"

The Analogy

A nightclub has a capacity of 500 people. The bouncer lets in 10 people per minute regardless of the queue size. If 1,000 people arrive at once, they wait in line. Nobody is turned away permanently — they are just throttled.

This protects the club (your API) from being overwhelmed while still serving everyone who is patient.

The Token Bucket Algorithm

Imagine each user has a bucket that fills with tokens at a fixed rate (e.g., 10 tokens per second). Each API request costs 1 token. If the bucket is empty, the request is rejected (HTTP 429).

A user who makes occasional bursts can draw down their bucket quickly. A user making sustained high-volume requests runs out and is throttled.

Why not just a counter? A counter that resets every minute allows a burst of all 600 requests in the first second, then silence for 59 seconds. Token bucket smooths this — consistent rate, no burst exploitation.

Real Example: GitHub API

GitHub limits unauthenticated requests to 60/hour. Authenticated requests: 5,000/hour. Enterprise GitHub apps: 15,000/hour. The limit is per-token, not per-IP, which means a shared corporate network does not penalise all developers if one person writes a bad script.

8. Database Indexing — "The Book Index vs Reading Every Page"

The Analogy

You need to find every mention of "TCP/IP" in a 1,000-page networking textbook.

Without an index: Read every page from cover to cover. Correct, but slow — O(n).

With an index: Turn to the back. Find "TCP/IP" in the sorted index. It says "pages 47, 203, 418, 891." Jump directly there. O(log n).

A database index is the same concept. A B-tree index on email in a users table lets the database jump directly to the matching row instead of scanning all 50 million rows.

The Trade-off

Indexes speed up reads but slow down writes. Every INSERT or UPDATE must update the index as well as the data.

For a table that is mostly read (product catalogue, user profiles) — index heavily. For a table that is mostly written (event logs, sensor telemetry) — index sparingly. Adding an index to an append-only log table is like maintaining a 1,000-page index for a newspaper that prints a new edition every minute.

Covering Indexes

A covering index stores the values you are querying directly in the index leaf nodes — no need to access the table at all.

Analogy: Instead of the index saying "page 47," it writes the full paragraph on TCP/IP inline. You never need to open the book.

SQL

-- Index covers both the WHERE and the SELECT — zero table access
CREATE INDEX idx_orders_cover ON orders (tenant_id, status)
  INCLUDE (created_at, total_amount);

9. CDN — "The Local Warehouse Network"

The Concept

A Content Delivery Network stores copies of static assets (images, CSS, JavaScript, videos) in data centres distributed globally. Users fetch assets from the nearest data centre — not from your origin server in Frankfurt.

The Analogy

Amazon's logistics network. When you order a popular product, Amazon does not ship it from a single central warehouse in Germany to every customer in Europe. They pre-position inventory in regional fulfillment centres — Oslo, Madrid, Warsaw, London. Your order ships from the centre nearest you. Delivery is next-day instead of a week.

CDN = the regional fulfillment centre. Your origin server = the central warehouse. Static assets = pre-positioned inventory.

Real Example: Cloudflare

Cloudflare has 250+ data centres globally. A user in Lagos requesting a website hosted in Dublin gets JavaScript and images served from Cloudflare's Lagos node — microseconds away. Without CDN: a round trip Dublin → Lagos → Dublin takes 300–400ms. With CDN: served from Lagos in 5ms. For a page with 50 assets, the difference is seconds vs milliseconds.

10. Consistent Hashing — "Expanding the Restaurant Without Reassigning All Tables"

The Problem

You have 4 cache servers. You assign requests using hash(key) % 4. If you add a 5th server, the modulus changes. hash(key) % 4 → hash(key) % 5. Almost every key now maps to a different server. Your entire cache is invalidated. Every request hits the origin database. At scale, this causes a cascade failure.

Consistent Hashing Solution

Analogy: Arrange the cache servers around a clock face. When a request comes in, hash it to a point on the clock. Walk clockwise from that point until you hit a server — that server owns the request.

When you add a new server, it takes a position on the clock. It only takes responsibility for requests between its predecessor and itself — a small fraction of total traffic. Removing a server passes only its portion to its successor.

Result: Adding or removing a server invalidates only a small percentage of cache entries, not all of them.

Real Use: Amazon DynamoDB, Apache Cassandra

DynamoDB uses consistent hashing internally to distribute data across storage nodes. When the system adds capacity (more nodes), only the data that falls in the new node's ring segment needs to migrate. The rest of the data remains in place — no full reshuffle.

The Decision Framework

When you are designing a system, walk through these questions in order:

1. What is the scale?
   < 1M users → single server + PostgreSQL may be enough
   > 10M users → need to plan for horizontal scaling

2. What is the read/write ratio?
   Read-heavy (10:1+) → add read replicas and caching
   Write-heavy → focus on write throughput, avoid indexes on hot tables

3. Where is the consistency requirement?
   Money, inventory, medical records → strong consistency (relational DB, ACID)
   Analytics, social feeds, product catalogue → eventual consistency acceptable

4. What are the failure modes?
   What happens if one service crashes?
   What happens if the network splits?
   Add message queues and outbox where you cannot afford to lose events

5. What is the latency requirement?
   < 100ms → cache aggressively, avoid cross-region calls
   < 10ms → in-memory data structures, CDN for static assets

Summary: Every Pattern Has a Problem It Solves

| Pattern | The problem it solves | |---|---| | Horizontal scaling | Single server hit its ceiling | | Caching | Database too slow for repeated reads | | Read replicas | Too many read queries for one database | | Sharding | Too much data for one database server | | Event-driven architecture | Services too tightly coupled, changes cascade | | Outbox pattern | Database update and message must be atomic | | Rate limiting | Protect the system from burst abuse | | CDN | Static assets served from a single distant origin | | Consistent hashing | Cache invalidation during cluster resize | | Indexing | Full table scans too slow |

Every senior engineer carries these patterns as intuitions, not memorised facts. When you see a problem, you should immediately think: "this looks like a hot partition problem" or "this needs an outbox." That pattern recognition — not recitation — is what separates senior from mid-level engineers.

Why Mental Models Beat Memorisation

1. Vertical vs Horizontal Scaling — "One Super-Chef vs Many Chefs"

Vertical Scaling

Horizontal Scaling

Real Example: Netflix

2. Caching — "The Sticky Note on Your Monitor"

The Core Idea

The Analogy

Cache Invalidation — "When Does the Sticky Note Go Stale?"

Real Example: Reddit's Front Page

3. Database Sharding — "The Filing Cabinet Problem"

The Concept

The Analogy

The Hot Partition Problem

Real Example: Instagram's User Data

4. The Read Replica Pattern — "The Photocopy Room"

The Analogy

Trade-off: Replication Lag

Real Example: GitHub

5. Event-Driven Architecture — "The News Broadcast Model"

The Concept

The Analogy

The Trade-off

Real Example: Uber's Trip Events

6. The Outbox Pattern — "Sending the Letter After You Update the Ledger"

The Problem

The Analogy

In Software

Real Example: Healthcare Booking Systems

7. Rate Limiting — "The Nightclub Bouncer"

The Analogy

The Token Bucket Algorithm

Real Example: GitHub API

8. Database Indexing — "The Book Index vs Reading Every Page"

The Analogy

The Trade-off

Covering Indexes

9. CDN — "The Local Warehouse Network"

The Concept

The Analogy

Real Example: Cloudflare

10. Consistent Hashing — "Expanding the Restaurant Without Reassigning All Tables"

The Problem

Consistent Hashing Solution

Real Use: Amazon DynamoDB, Apache Cassandra

The Decision Framework

Summary: Every Pattern Has a Problem It Solves

Enjoyed this article?

Leave a comment