System Design · Lesson 6 of 26

Database Choices — SQL vs NoSQL, Sharding & Replicas

The database is the heart of most systems. Choose the wrong one, and you'll feel it for years. This article gives you the framework to match database type to workload, understand scaling techniques, and make confident decisions in interviews.


The Database Landscape

There is no "best database" — there are databases that fit certain access patterns well.

┌─────────────────────────────────────────────────────────────────┐
│                     Database Types                               │
├──────────────┬────────────────────────────────────────────────── │
│ Relational   │ PostgreSQL, MySQL, SQL Server                     │
│ Document     │ MongoDB, CouchDB, Firestore                       │
│ Key-Value    │ Redis, DynamoDB, Riak                             │
│ Column-Family│ Cassandra, HBase, ScyllaDB                        │
│ Graph        │ Neo4j, Amazon Neptune, ArangoDB                   │
│ Time-Series  │ InfluxDB, TimescaleDB, Prometheus                 │
│ Search       │ Elasticsearch, OpenSearch, Typesense              │
│ NewSQL       │ CockroachDB, Google Spanner, TiDB                 │
└──────────────┴───────────────────────────────────────────────────┘

Relational Databases (SQL)

Best for: Structured data with relationships, ACID transactions, complex queries with joins.

When to use PostgreSQL/MySQL:

  • Financial data (accounts, transactions, ledgers)
  • E-commerce orders (order → items → products → inventory)
  • User management (users → roles → permissions)
  • Any system where data integrity is critical

ACID guarantees:

Atomicity   — transaction either fully completes or fully rolls back
Consistency — data always meets defined constraints (foreign keys, checks)
Isolation   — concurrent transactions don't interfere with each other
Durability  — committed transactions survive crashes (WAL on disk)

Scaling relational databases:

  • Vertical scaling takes you surprisingly far (PostgreSQL handles 10K+ QPS on a large instance)
  • Connection pooling (PgBouncer) — critical at scale
  • Read replicas — distribute read load
  • Partitioning — built-in table partitioning for large tables
  • Sharding — last resort (complex, breaks joins across shards)

Document Databases

Best for: Hierarchical, nested data that's read together; flexible schema; high write throughput.

MongoDB example:

JSON
{
  "_id": "order_123",
  "userId": "user_456",
  "status": "shipped",
  "items": [
    { "productId": "prod_789", "quantity": 2, "price": 29.99 },
    { "productId": "prod_012", "quantity": 1, "price": 49.99 }
  ],
  "address": {
    "street": "123 Main St",
    "city": "Seattle",
    "zip": "98101"
  }
}

Pros: Schema flexibility. Nested data reads are fast (no joins needed). Horizontal scaling built-in.

Cons: No cross-document transactions (in most cases). Denormalization leads to data duplication and inconsistency. Bad for complex relational queries.

Use when: The data is naturally hierarchical and read together. Content management, product catalogs, user profiles with variable fields.


Key-Value Stores

Best for: Simple lookups by key, caching, session storage, distributed counters.

Key               Value
────────────────────────────────────
session:abc123  → {userId:1, cart:[...]}
rate:192.168.1.1:2026041415 → 47
feature:dark-mode → enabled
lock:payment:order_123 → processing

Redis dominates here (with data structures beyond simple strings).

DynamoDB serves as both a key-value and document store with managed scaling.

Use when: You need sub-millisecond reads on simple key lookups. Sessions, caches, leaderboards, rate limiting.


Column-Family Databases

Best for: Write-heavy workloads, time-series-like data, massive scale with eventual consistency.

Cassandra data model:

Partition key determines which node stores the data.
Clustering key determines row ordering within a partition.

CREATE TABLE sensor_readings (
    device_id  UUID,
    timestamp  TIMESTAMP,
    temperature FLOAT,
    humidity    FLOAT,
    PRIMARY KEY (device_id, timestamp)  -- device_id=partition, timestamp=clustering
) WITH CLUSTERING ORDER BY (timestamp DESC);

Writes are extremely fast — Cassandra appends to an in-memory structure (memtable) and a commit log. No random writes.

Reads must match the data model — you cannot query a column-family store like a relational DB. Design tables for your access patterns.

Use when: IoT telemetry, time-series sensor data, activity logs, write-heavy workloads at massive scale.


Graph Databases

Best for: Highly connected data where relationships are first-class.

Relational approach for "friends of friends":
  SELECT u2.name FROM users u1
  JOIN friendships f1 ON u1.id = f1.user_id
  JOIN friendships f2 ON f1.friend_id = f2.user_id
  JOIN users u2 ON f2.friend_id = u2.id
  WHERE u1.id = 123
  -- Gets exponentially complex at 3+ hops

Graph approach (Cypher query, Neo4j):
  MATCH (u:User {id: 123})-[:FRIENDS_WITH*1..3]-(recommended)
  RETURN recommended.name
  -- Natural traversal, optimized for graph walks

Use when: Social graphs, recommendation engines, fraud detection (unusual connection patterns), knowledge graphs, network topology.


Read Replicas

Read replicas are asynchronous copies of the primary database. Reads are routed to replicas; writes go to the primary.

Write:
  App → Primary DB (writes acknowledged here)
          ↓ async replication
       Replica 1 (reads routed here)
       Replica 2 (reads routed here)
       Replica 3 (cross-region, for disaster recovery)

Read:
  App → Replica 1 (may be slightly behind primary)

Consistency trade-off: Replicas lag behind the primary. Replication lag is typically 10ms–1s under normal conditions. For most read workloads, this is acceptable. For "read your own write" scenarios, route the user's first post-write read to the primary.

Scale factor: 3 read replicas means your read throughput is ~3× what a single server could handle.


Sharding

Sharding is horizontal partitioning — splitting data across multiple database nodes. Each node holds a subset of the data.

Without sharding: 1 node, all data
  Node 1: all 100M users

With sharding: 4 nodes, data divided
  Shard 1: users 0–24M
  Shard 2: users 25M–49M
  Shard 3: users 50M–74M
  Shard 4: users 75M–99M

Sharding Strategies

Range-based sharding:

Shard by user_id range:
  Shard 1: user_id 1 – 1,000,000
  Shard 2: user_id 1,000,001 – 2,000,000
  ...

Pros: Simple range queries (find all users signed up this month). Cons: Hot spots — new users all land on the latest shard. Uneven data distribution.

Hash-based sharding:

shard_id = hash(user_id) % num_shards
user_id 12345 → hash = 98765432 → 98765432 % 4 = 0 → Shard 1
user_id 12346 → hash = 87654321 → 87654321 % 4 = 1 → Shard 2

Pros: Even distribution. Eliminates hot spots. Cons: Range queries span all shards. Resharding when adding nodes is painful (all keys remap).

Consistent hashing solves the resharding problem:

Nodes arranged on a hash ring
New node added → only keys between it and its predecessor remap
Result: ~1/N of keys remap instead of all keys

Directory-based sharding: A lookup service maps each key to its shard.

Lookup service: {
  user_123 → Shard 2,
  user_456 → Shard 1,
  ...
}

Pros: Maximum flexibility — can rebalance without resharding. Cons: Lookup service is a bottleneck and SPOF.

Hot Spots in Sharding

A hot spot is when one shard receives disproportionate traffic.

Problem: celebrity user has 100M followers
  All their timeline writes → same shard
  That shard overwhelmed while others are idle
  
Solution options:
  - Add a random suffix to the key (user_123_1, user_123_2, ...) and aggregate reads
  - Route celebrity data differently
  - Use an append-only fan-out model

Sharding Limitations

Cross-shard joins don't work. If User and Order are on different shards, you cannot JOIN them in a single query. Solutions:

  • Denormalize (embed enough data in each document/row)
  • Aggregate at application level
  • Accept that some queries require cross-shard scatter-gather

Sharding is a last resort. Explore vertical scaling, read replicas, caching, and partitioning before sharding.


Database Indexes Deep Dive

Indexes are the most impactful query optimization available. Understand them.

B-Tree Index (Default)

Table: orders (100M rows)
Query: SELECT * FROM orders WHERE user_id = 12345

Without index: Full table scan = 100M comparisons
With B-tree index on user_id:
  Tree depth ≈ log₂(100M) ≈ 27 comparisons

B-tree indexes support: =, <, >, BETWEEN, LIKE 'prefix%'

Hash Index

Good for: exact equality lookups only (=)
Bad for: range queries (< > BETWEEN)

Composite Index

SQL
-- Index on (user_id, created_at)
CREATE INDEX idx_orders_user_created ON orders (user_id, created_at DESC);

-- This query uses the index fully:
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;

-- This query uses only the first column of the index (user_id):
SELECT * FROM orders WHERE user_id = 123;

-- This query cannot use the index (doesn't start with user_id):
SELECT * FROM orders WHERE created_at > '2026-01-01';

Left-prefix rule: Composite indexes are used left-to-right. An index on (a, b, c) helps queries filtering on a, a+b, or a+b+c, but not on b alone or c alone.

Partial Index

SQL
-- Index only on active orders  smaller, faster
CREATE INDEX idx_active_orders ON orders (user_id) WHERE status = 'active';

-- Very efficient for:
SELECT * FROM orders WHERE user_id = 123 AND status = 'active';

Index Trade-offs

Every index speeds up reads but slows down writes (the index must be updated on insert/update/delete). Don't over-index write-heavy tables.


Connection Pooling

Database connections are expensive. Each PostgreSQL connection consumes ~5-10MB RAM and a process/thread. At scale, you need pooling.

Without pooling:
  100 app servers × 10 connections each = 1,000 PG connections
  PostgreSQL limit ≈ 100-300 connections before performance degrades

With PgBouncer (transaction mode):
  100 app servers → PgBouncer → 50 PG connections
  PgBouncer multiplexes app connections over the pool
  App gets a connection for the duration of one transaction, then returns it

NewSQL — Strong Consistency + Horizontal Scale

NewSQL databases attempt to give you relational semantics (ACID, SQL) with horizontal scalability.

CockroachDB: Distributed SQL. Postgres-compatible. Multi-region. Survives datacenter failures.

SQL
-- Standard SQL syntax
SELECT o.id, u.name, SUM(oi.price * oi.quantity) as total
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON o.id = oi.order_id
GROUP BY o.id, u.name;
-- But data is distributed across nodes automatically

Google Spanner: Google's globally distributed database. Used for Google Ads, Google F1. Achieves external consistency using TrueTime (synchronized atomic clocks + GPS).

Use NewSQL when: You need both horizontal scalability AND relational semantics. Warning: they add operational complexity and cost.


Choosing a Database: Real Examples

Social Media Feed (Twitter/Instagram)

  • Timeline writes: Cassandra — write-heavy, append-only, time-ordered, eventually consistent is fine
  • User profiles: PostgreSQL or DynamoDB — structured, occasional updates
  • Social graph: Neo4j or adjacency list in PostgreSQL (for simpler graphs)
  • Media storage: Object storage (S3) — databases are not for images/videos

E-commerce Cart

  • Cart contents: Redis — fast reads/writes, TTL-based expiry, needs to be fast
  • Orders: PostgreSQL — transactions, inventory consistency critical
  • Product catalog: MongoDB or PostgreSQL — read-heavy, flexible schema per product type

Real-Time Chat App

  • Messages: Cassandra — write-heavy, time-series per conversation, high volume
  • Active presence/online status: Redis — fast TTL-based status
  • User/channel metadata: PostgreSQL

IoT Time-Series Sensor System

  • Sensor readings: TimescaleDB (PostgreSQL extension) or InfluxDB — time-series optimized, compression, time-based queries
  • Device registry: PostgreSQL — structured, relational device metadata
  • Alerting state: Redis — fast read of current alert status

Key Takeaways

  • Relational (PostgreSQL): ACID transactions, complex queries, financial data. Scales further than people expect.
  • Document (MongoDB): Hierarchical data, flexible schema, high write throughput.
  • Key-Value (Redis, DynamoDB): Sub-millisecond lookups, sessions, caching.
  • Column-Family (Cassandra): Massive write throughput, time-series, eventual consistency.
  • Graph (Neo4j): Highly connected data, relationship traversals.
  • Read replicas: Scale reads horizontally. Accept small replication lag.
  • Sharding: Last resort. Breaks joins. Use consistent hashing to minimize resharding pain.
  • Hot spots happen with range sharding + uneven access — design shard keys carefully.
  • Indexes: B-tree for most queries, composite indexes follow left-prefix rule, partial indexes for filtered subsets.
  • PgBouncer: Essential for PostgreSQL at scale — prevents connection exhaustion.
  • NewSQL: CockroachDB/Spanner when you need both distributed scale AND relational semantics.