System Design · Lesson 5 of 26

Load Balancing — Distribute Traffic Without Thinking About It

A load balancer is the front door of any scaled system. It sits between clients and servers and distributes incoming requests so no single server bears the entire load. Simple concept — complex reality.

What a Load Balancer Does

Without load balancer:
  Client → Server (single point of failure, capacity ceiling)

With load balancer:
                     ┌─────────────────────┐
  Client ──────────→ │    Load Balancer     │
                     └──────────┬──────────┘
                        ┌───────┼───────┐
                   ┌────▼──┐ ┌──▼───┐ ┌─▼─────┐
                   │ App 1 │ │App 2 │ │ App 3 │
                   └───────┘ └──────┘ └───────┘

A load balancer does more than just distribute traffic:

Routes requests to healthy instances
Health checks — removes unhealthy instances from the pool
TLS termination — decrypts HTTPS at the edge so backend servers handle plain HTTP
Connection management — maintains persistent connections to backends
Observability — request logging, metrics, request tracing headers

L4 vs L7 Load Balancing

This is one of the most important distinctions. It's about which layer of the network stack the load balancer operates at.

L4 — Transport Layer (TCP/UDP)

The load balancer sees IP addresses and port numbers. It does not look at the content of the request.

L4 sees:
  Source IP: 192.168.1.50
  Dest IP: 10.0.0.1
  Dest Port: 443
  
Routing decision: based only on IP/port

Characteristics:

Very fast (minimal processing per packet)
No understanding of HTTP, WebSocket, gRPC
Cannot route based on URL paths or headers
Lower CPU overhead
Works with any TCP/UDP protocol

Use when: Raw throughput is critical, or when you're using non-HTTP protocols (database connections, custom TCP protocols).

L7 — Application Layer (HTTP)

The load balancer understands HTTP. It can inspect URLs, headers, cookies, and request bodies.

L7 sees:
  GET /api/users/123 HTTP/2
  Host: api.example.com
  Authorization: Bearer eyJ...
  Cookie: session=abc123
  
Routing decision: based on URL, headers, cookies

Characteristics:

Slower per request (parses HTTP)
Content-based routing (route /api/* to API servers, /static/* to CDN)
Can inject/rewrite headers
Can terminate TLS and re-encrypt
Sticky sessions via cookies
More observability (knows request paths, status codes)

Use when: HTTP/HTTPS traffic (which is most web applications).

Load Balancing Algorithms

Round Robin

Requests are distributed in order: Server 1, Server 2, Server 3, Server 1, Server 2...

Request 1 → Server 1
Request 2 → Server 2
Request 3 → Server 3
Request 4 → Server 1  (wraps around)

Pros: Simple. Uniform distribution (assuming equal server capacity).

Cons: Doesn't account for server load. A slow server may accumulate requests.

Use when: All servers have equal capacity and requests are roughly equal in processing time.

Weighted Round Robin

Servers with higher capacity get proportionally more requests.

Server 1 weight=3, Server 2 weight=1
Request 1 → Server 1
Request 2 → Server 1
Request 3 → Server 1
Request 4 → Server 2
Request 5 → Server 1
...

Use when: Servers have different hardware specs, or you're doing a gradual canary rollout (new version starts at weight=1, gradually increased).

Least Connections

New request goes to the server with the fewest active connections.

Server 1: 50 active connections
Server 2: 10 active connections  ← next request goes here
Server 3: 35 active connections

Pros: Better for long-running requests. Naturally handles slow servers.

Use when: Long-lived connections (WebSockets, file uploads), or mixed workloads where some requests are much slower.

IP Hash (Session Stickiness via Hashing)

The client's IP is hashed to always map to the same server.

hash(client_ip) % num_servers = server_index
192.168.1.50 → hash → mod 3 = 1 → always Server 2

Pros: Session affinity without cookies. Consistent routing for the same client.

Cons: Uneven distribution if many clients come from the same IP (e.g., corporate NAT). If a server dies, all hashed clients re-route and lose state.

Random

Randomly select a server for each request.

Use when: Servers are truly identical and you want simplicity without round-robin's state management. Converges to even distribution at scale due to the law of large numbers.

Sticky Sessions

Sticky sessions ensure that a client always hits the same backend server.

First request:
  Client → LB → Server A
  LB sets cookie: SERVERID=server-a

Subsequent requests:
  Client → LB (reads SERVERID cookie) → Server A (always)

Why sticky sessions exist: Stateful applications that store session data in memory need the same server to handle all requests from the same user.

Why sticky sessions hurt scalability:

If Server A crashes, all its sticky clients lose their sessions
Load becomes uneven as users "pile up" on certain servers
Adding new servers doesn't immediately rebalance existing clients
Prevents true horizontal scaling

The fix: eliminate the need for sticky sessions by externalizing session state to Redis. Then any server can handle any request.

Without Redis (needs sticky sessions):
  Server A: {session:abc → {cart: [item1, item2]}}
  Server B: {session:def → {cart: [item3]}}

With Redis (no sticky sessions needed):
  Redis: {
    session:abc → {cart: [item1, item2]},
    session:def → {cart: [item3]}
  }
  Any server reads from Redis on every request

Health Checks

Load balancers must detect unhealthy instances and stop routing to them.

Active Health Checks

The load balancer periodically sends test requests to each backend.

Every 10 seconds:
  LB → Server 1: GET /health → 200 OK ✓
  LB → Server 2: GET /health → 200 OK ✓
  LB → Server 3: GET /health → timeout ✗ (remove from pool)

Your /health endpoint should check:

Application is running
Database connection is alive
Key dependencies are reachable
Not check things that could cause a cascade (don't call external APIs)

app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
}).AllowAnonymous();

Passive Health Checks

The load balancer monitors real traffic responses. If a server returns 5xx errors, it's marked unhealthy.

More responsive (no polling interval), but requires real user traffic to detect failures.

Most production LBs use both: active checks for basic liveness, passive checks for quality degradation.

Software Load Balancers

NGINX

The most widely used. Can be both a web server and a reverse proxy/load balancer.

NGINX

upstream api_servers {
    least_conn;  # algorithm
    server app1.internal:8080 weight=3;
    server app2.internal:8080 weight=1;
    server app3.internal:8080 backup;  # only used if others fail
    
    keepalive 32;  # connection pool to backends
}

server {
    listen 443 ssl;
    location /api/ {
        proxy_pass http://api_servers;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

HAProxy

High Availability Proxy — often preferred for TCP load balancing and high-performance HTTP.

frontend http-in
    bind *:80
    default_backend web-servers

backend web-servers
    balance roundrobin
    option httpchk GET /health
    server web1 10.0.0.1:8080 check
    server web2 10.0.0.2:8080 check
    server web3 10.0.0.3:8080 check

Cloud Load Balancers

AWS Application Load Balancer (ALB) — L7, HTTP/HTTPS/WebSocket/gRPC
AWS Network Load Balancer (NLB) — L4, ultra-high throughput, static IPs
Azure Application Gateway — L7 with WAF, SSL offload
Azure Load Balancer — L4, internal and external

DNS Load Balancing

DNS returns multiple IP addresses for the same hostname. Clients connect to one of the returned IPs.

DNS query: api.example.com
DNS response:
  api.example.com → 10.0.0.1
  api.example.com → 10.0.0.2
  api.example.com → 10.0.0.3

Client picks one (usually the first, sometimes round-robin)

Pros: Simple. No single load balancer. Can route to different geographies.

Cons: DNS TTL means clients cache the IP — you can't quickly remove a failed server. No health checks (DNS doesn't know if an IP is alive). Client-side load balancing is unpredictable.

Use for: Geographic routing (different DNS responses for US vs EU clients), or as a first hop before regional load balancers.

Global Load Balancing with Anycast

Anycast assigns the same IP address to servers in multiple geographic locations. BGP routing directs clients to the nearest server.

IP: 1.2.3.4 exists at:
  - New York datacenter
  - London datacenter
  - Tokyo datacenter

Client in London → BGP routes to London (shortest AS path)
Client in New York → BGP routes to New York
Client in Seoul → BGP routes to Tokyo

Cloudflare, AWS CloudFront, and Google Cloud CDN use anycast for their edge networks. This is how CDNs route you to the nearest edge with a single IP.

Database Load Balancing

Application-level load balancers don't work well for databases. Databases have state.

PgBouncer — PostgreSQL Connection Pooling

PostgreSQL connections are expensive (each consumes ~5-10MB RAM). PgBouncer acts as a proxy that manages a pool of connections.

Without PgBouncer:
  1000 app instances × 10 DB connections each = 10,000 PG connections
  PostgreSQL struggles above ~500-1000 connections

With PgBouncer:
  1000 app instances → PgBouncer (pool of 100 PG connections) → PostgreSQL
  PgBouncer multiplexes app requests over the pool

INI

# pgbouncer.ini
[pgbouncer]
pool_mode = transaction     # connection released after each transaction
max_client_conn = 1000      # app connections PgBouncer accepts
default_pool_size = 100     # connections to PostgreSQL

Load Balancer vs API Gateway

These are frequently confused. They're different tools:

| | Load Balancer | API Gateway | |---|---|---| | Primary purpose | Distribute traffic | Manage API access | | Routing | By IP, URL prefix | By route, method, version | | Authentication | No | Yes (JWT, API keys, OAuth) | | Rate limiting | No | Yes | | Request transformation | No | Yes | | Service aggregation | No | Yes (BFF pattern) | | Protocol translation | Limited | Yes (REST → gRPC) | | Examples | NGINX, HAProxy, AWS ALB | Kong, YARP, AWS API Gateway |

In practice: Production systems often have both — a load balancer for raw traffic distribution and a layer 7 API gateway for authentication, routing, and rate limiting.

Key Takeaways

L4 load balancing is fast and protocol-agnostic. L7 understands HTTP and enables content-based routing.
Round robin for equal servers. Least connections for long-running workloads. IP hash for session affinity without cookies.
Sticky sessions are a smell — externalize session state to Redis and go stateless instead.
Health checks: active (LB polls /health) + passive (monitor real traffic errors).
NGINX and HAProxy dominate software LBs. Cloud LBs (ALB, Azure App Gateway) are managed equivalents.
DNS load balancing works for geographic routing. Anycast routes by network proximity.
PgBouncer solves database connection exhaustion.
A load balancer distributes traffic. An API gateway manages API access, auth, and rate limiting. Use both.

Caching Strategies — When to Cache and When Not To

Next Lesson

Database Choices — SQL vs NoSQL, Sharding & Replicas