System Design · Lesson 2 of 26

Scalability, Availability & Reliability — The Core Trade-offs

Every system design interview — and every real architectural decision — comes back to the same set of trade-offs. Before you pick a database, choose a queue, or decide whether to shard, you need to understand what you're actually optimizing for.

This article covers the foundational vocabulary and calculations you'll use in every design discussion.

Scalability

Scalability is the system's ability to handle more load without a complete redesign.

There are two dimensions:

Vertical Scaling (Scale Up)

Add more resources to a single machine: bigger CPU, more RAM, faster disk.

Before:              After:
┌────────────┐       ┌──────────────────┐
│  4 CPU     │  →    │  32 CPU          │
│  8 GB RAM  │       │  128 GB RAM      │
│  1 TB SSD  │       │  10 TB NVMe      │
└────────────┘       └──────────────────┘

Pros: Simple. No application changes. No distributed systems complexity.

Cons: Hard ceiling (you can't infinitely scale one machine). Single point of failure. Expensive at the top end. Requires downtime to upgrade.

Use when: Early stage, relational databases (PostgreSQL scales surprisingly far vertically), or when the cost of distributed complexity isn't justified.

Horizontal Scaling (Scale Out)

Add more machines and distribute load across them.

            ┌──────────────┐
            │ Load Balancer│
            └──────┬───────┘
         ┌─────────┼─────────┐
    ┌────▼───┐ ┌───▼────┐ ┌──▼─────┐
    │ App 1  │ │ App 2  │ │ App 3  │
    └────────┘ └────────┘ └────────┘

Pros: Near-unlimited scale. No single point of failure. Can add capacity incrementally.

Cons: Application must be stateless (or state must be externalized). Distributed systems are hard. Network becomes a failure domain.

Key insight: Horizontal scaling requires stateless services. If each request can hit any server, that server can't rely on in-memory state from a previous request. Session data, user state, and cache must live outside the app (Redis, databases).

Stateless vs Stateful Services

This is one of the most important architectural distinctions.

Stateless service: Every request contains all the information needed to handle it. No server-side state between requests.

Request 1 → Server A (handles it completely)
Request 2 → Server B (handles it completely, no knowledge of Request 1)

Stateful service: The server remembers something between requests. A WebSocket connection is stateful. A game server tracking player positions is stateful.

Client ──── persistent connection ──── Server A
   (if Server A dies, client must reconnect, state is lost)

Why Stateless Scales Better

A stateless service can have 1 or 100 instances behind a load balancer — they're all identical. You can add or remove instances without coordination.

A stateful service must route the same client to the same server (sticky sessions), making load balancing harder and failover painful.

Design rule: Push state to dedicated stores (Redis, databases). Keep your application layer stateless.

Availability

Availability is the percentage of time a system is operational.

The Nines

| Availability | Downtime per year | Downtime per month | Common name | |---|---|---|---| | 99% | 87.6 hours | 7.3 hours | Two nines | | 99.9% | 8.76 hours | 43.8 minutes | Three nines | | 99.99% | 52.6 minutes | 4.4 minutes | Four nines | | 99.999% | 5.26 minutes | 26.3 seconds | Five nines |

Key takeaway: Going from 99.9% to 99.99% means reducing downtime by ~8 hours per year. That last decimal point gets exponentially harder and more expensive.

Most SaaS targets 99.9% (three nines). Financial systems and telecom target 99.99% or higher.

Achieving High Availability

High availability requires redundancy at every layer:

                    DNS (TTL matters)
                         │
              ┌──────────▼──────────┐
              │   Load Balancer     │  ← Multiple LBs (active/passive)
              │   (active/active)   │
              └──────────┬──────────┘
                    ┌────┴────┐
              ┌─────▼──┐ ┌───▼────┐
              │ App 1  │ │ App 2  │    ← Multiple app instances
              └─────┬──┘ └───┬────┘
                    └────┬───┘
              ┌──────────▼──────────┐
              │  DB Primary         │  ← Primary + replica(s)
              │  DB Replica         │
              └─────────────────────┘

Single points of failure (SPOF) kill availability. Find them and eliminate them.

Reliability vs Availability

These terms are related but distinct — and confusing them in an interview hurts.

Availability: Is the system up right now? Reliability: When the system is up, does it behave correctly?

A system can be highly available but unreliable — it's always running, but returns wrong answers 5% of the time.

A system can be reliable but not highly available — when it's running it's perfectly correct, but it has scheduled maintenance windows.

In practice, you want both. But they require different solutions:

Availability: redundancy, failover, health checks
Reliability: testing, idempotency, data validation, circuit breakers

Latency vs Throughput

Latency: How long does a single request take? (milliseconds) Throughput: How many requests can the system handle per second? (RPS / QPS)

They are related but not the same. You can have:

High throughput, high latency: A batch processing system that processes 1M records/second but each record takes 500ms
Low latency, low throughput: A single-threaded server that responds in 1ms but can only handle 1 request at a time

Latency Numbers Every Engineer Should Know

Operation                              Latency
─────────────────────────────────────────────────
L1 cache reference                     0.5 ns
L2 cache reference                     7 ns
Mutex lock/unlock                      25 ns
Main memory reference                  100 ns
SSD random read (4KB)                  150 µs
Network round trip (same datacenter)   500 µs
SSD sequential read (1MB)              1 ms
HDD seek                               10 ms
Network round trip (US to Europe)      150 ms

Design implication: A database call in the same datacenter costs ~1ms. Calling a service in another region costs ~150ms. Design your service topology accordingly.

Read-Heavy vs Write-Heavy Systems

Knowing your system's read/write ratio drives major architectural decisions.

Read-heavy (10:1+ reads to writes):

Social media feeds, product catalogs, news sites
Optimize with: caching, read replicas, CDN
Eventual consistency is usually acceptable

Write-heavy (many writes, fewer reads):

IoT telemetry, logging, financial transactions
Optimize with: write buffers, async writes, append-only structures
Consistency is usually critical

Mixed (roughly equal):

Chat apps, collaborative documents
Need to balance both; often partition by feature

Interview tip

When you get a system design question, always ask: "Is this read-heavy or write-heavy?" It determines whether you invest in caching, read replicas, and CDN — or write batching, async queues, and sharding.

Back-of-Envelope Estimation

This is a required skill for system design interviews. The goal is not precision — it's getting within an order of magnitude to justify architectural decisions.

Step 1: Clarify Scale

Always ask: "How many users? What's the expected QPS? How much data?"

If they don't know, estimate from first principles using daily active users (DAU).

Step 2: Estimate QPS

Formula:

QPS = (DAU × requests_per_user_per_day) / 86,400
Peak QPS ≈ QPS × 2–3 (traffic is not uniform)

Example: Twitter-like system

200M DAU
Average user makes 10 read requests and 1 write request per day
Read QPS = (200M × 10) / 86,400 ≈ 23,000 read QPS
Write QPS = (200M × 1) / 86,400 ≈ 2,300 write QPS
Peak read QPS ≈ 70,000 (3× for morning rush)

Step 3: Estimate Storage

Example: Photo storage (Instagram-like)

200M DAU
10% of users post 1 photo per day = 20M photos/day
Average photo size: 2MB (after compression)
Storage per day: 20M × 2MB = 40TB/day
Storage per year: 40TB × 365 ≈ 14.6 PB

That tells you: you need distributed object storage (S3-class), not a database.

Step 4: Estimate Bandwidth

Read bandwidth:

Read bandwidth = Peak read QPS × average response size

Example: Video streaming

10M concurrent streams
Average bitrate: 5 Mbps
Total bandwidth: 10M × 5 Mbps = 50 Tbps

That's a CDN problem, not a single-server problem.

Common Estimates to Memorize

1M users × 1 request/day     ≈ 12 RPS
1M users × 1 request/hour    ≈ 278 RPS
1M users × 1 request/minute  ≈ 16,667 RPS

1 KB  × 1M  = 1 GB
1 KB  × 1B  = 1 TB
1 MB  × 1M  = 1 PB

Powers of 2:
2^10 = 1,024 ≈ 1K
2^20 ≈ 1M
2^30 ≈ 1B
2^40 ≈ 1T

Putting It Together: A Design Conversation

When you start a system design, run through this mental checklist:

What's the scale? (DAU, QPS estimate)
Read-heavy or write-heavy? (Drives caching vs write optimization)
What availability do we need? (Determines redundancy investment)
Stateless or stateful? (Determines horizontal scaling strategy)
Latency requirements? (Determines whether to use CDN, cache, read replicas)

Get these answers before drawing a single box. They determine everything else.

Key Takeaways

Vertical scaling is simple but has a ceiling. Horizontal scaling requires stateless services.
Stateless services scale effortlessly. Push all state to Redis or databases.
99.9% availability = 8.76 hours downtime/year. Four nines = 52 minutes.
Reliability (correctness) is distinct from availability (uptime).
Latency is per-request time. Throughput is requests per second. Know the difference.
Back-of-envelope estimation tells you whether you need a cache, a CDN, or distributed storage — before you design anything.

In the next article, we'll cover CAP theorem — the fundamental constraint that governs every distributed system's consistency and availability guarantees.

System Design Complete Guide

Next Lesson

CAP Theorem — Why You Can't Have Everything