System Design · Lesson 23 of 26

Case Study: Design a Rate Limiter

Rate limiters protect services from being overwhelmed — by clients making too many requests, by scrapers, by credential-stuffing bots, or by your own buggy retry loops. Designing one correctly requires choosing the right algorithm for the right use case. This case study covers the full design.

Requirements

Functional:

Limit requests by: user ID, API key, IP address, or endpoint
Configurable limits: 100 requests/minute, 10 requests/second, etc.
Return 429 Too Many Requests with Retry-After header when limit exceeded
Limits can differ per client tier (free: 100/min, paid: 10,000/min)

Non-functional:

Low latency — rate limit check must add <5ms to every request
Distributed — multiple API servers must share rate limit state
Accurate — a client at exactly the limit should not be blocked; one over should
Resilient — if the rate limiter fails, the system should degrade gracefully (fail open or fail closed, depending on the use case)

The Four Algorithms

1. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute buckets). Count requests in the current window. Reset the counter at the window boundary.

Window: [12:00:00 — 12:00:59]
Limit:  100 requests/minute

Request count in window: 73 → allow
Request count in window: 100 → allow
Request count in window: 101 → block

Redis implementation:

INCR user:123:2026041512:00
EXPIRE user:123:2026041512:00 60

Problem: boundary burst. A client can fire 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 2 seconds, both windows allow it. This breaks the spirit of the limit.

Use when: The exact burst pattern doesn't matter and you need the simplest possible implementation.

2. Sliding Window Log

Store a timestamp log for each client. For each request, remove entries older than the window, count remaining entries.

Window: 1 minute
Limit:  100 requests

Log for user:123: [12:00:01, 12:00:15, 12:00:23, ..., 12:00:58]
New request at 12:01:10:
  - Remove entries before 12:00:10 (older than 1 minute)
  - Count remaining
  - If count < 100: allow, add 12:01:10 to log

Redis implementation: Use a sorted set with timestamp as the score.

ZADD user:123:log 1714900870 1714900870
ZREMRANGEBYSCORE user:123:log 0 (now - 60)
ZCARD user:123:log

Accurate, no boundary burst. But memory-heavy: stores every timestamp. For 1M users each at 100 requests/minute = 100M sorted set entries.

Use when: Accuracy is critical and you can tolerate higher memory usage.

3. Sliding Window Counter (Hybrid — Best for Most Cases)

Combine fixed window simplicity with sliding accuracy. Use two adjacent windows and weight by overlap.

Previous window count: 80 (12:00:00 — 12:00:59)
Current window count:  30 (12:01:00 — 12:01:59)
Current time: 12:01:30 (30 seconds into current window = 50% overlap with current)

Weighted estimate = 80 * (1 - 0.5) + 30 = 40 + 30 = 70 → allow

This approximates a true sliding window with just two counters per user. Accurate enough for 99% of use cases — the approximation error is <1%.

Redis implementation:

LUA

-- Lua script for atomicity
local current_key = "user:123:2026041512:01"
local prev_key    = "user:123:2026041512:00"
local now         = tonumber(ARGV[1])
local window_size = 60
local limit       = 100

local current_count = tonumber(redis.call("GET", current_key) or 0)
local prev_count    = tonumber(redis.call("GET", prev_key) or 0)
local elapsed_ratio = (now % window_size) / window_size

local estimate = prev_count * (1 - elapsed_ratio) + current_count

if estimate >= limit then
    return 0  -- blocked
end

redis.call("INCR", current_key)
redis.call("EXPIRE", current_key, window_size * 2)
return 1  -- allowed

Use when: You want accuracy close to sliding window log with the memory footprint of fixed window counters.

4. Token Bucket

Each client gets a bucket with a maximum capacity of tokens. Tokens are added at a fixed refill rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Bucket capacity: 100 tokens
Refill rate:     10 tokens/second
Current tokens:  0
→ block; retry in 0.1 seconds

Properties:

Allows short bursts up to bucket capacity
Average rate is bounded by refill rate
Natural model for APIs that want to allow "bursting" for occasional heavy users

Redis implementation:

LUA

local key         = KEYS[1]
local capacity    = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens/second
local now         = tonumber(ARGV[3])

local state  = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(state[1]) or capacity
local last   = tonumber(state[2]) or now

-- Refill
local elapsed = now - last
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

if new_tokens < 1 then
    return 0  -- no tokens
end

redis.call("HMSET", key, "tokens", new_tokens - 1, "last_refill", now)
redis.call("EXPIRE", key, 3600)
return 1

Use when: You want to allow bursts but enforce a long-term average rate. Good for APIs.

5. Leaky Bucket

Requests are added to a queue (the bucket). A consumer processes them at a fixed rate. If the queue is full, new requests are dropped.

Queue size: 10
Drain rate: 1 request/second

Burst of 20 requests → 10 queued, 10 dropped
Queue drains at 1/second → smooth output

Produces a smooth, constant output rate regardless of input burstiness. Good for payment processors or downstream services that can't handle bursts.

Algorithm Comparison

| Algorithm | Burst allowed? | Memory | Accuracy | Complexity | Best for | |-----------|--------------|--------|----------|------------|---------| | Fixed window | Yes (boundary) | Low | Low | Trivial | Internal tools | | Sliding log | No | High | Exact | Medium | Billing-critical APIs | | Sliding counter | Minimal | Low | ~99% | Medium | Most production APIs | | Token bucket | Yes (burst) | Low | Good | Medium | Developer APIs | | Leaky bucket | No | Medium | Exact | Medium | Payment/downstream protection |

The answer in 90% of interviews: sliding window counter or token bucket. Know why you're choosing one over the other.

Distributed Rate Limiting

With multiple API servers, a local in-memory counter doesn't work — each server has its own count. You need a shared store.

Option 1: Redis (recommended)

Redis is single-threaded. Lua scripts execute atomically. This is the standard approach.

API Server 1 ──→ Redis Cluster
API Server 2 ──→ Redis Cluster
API Server 3 ──→ Redis Cluster

Lua scripts guarantee the check-and-increment is atomic — no race condition where two servers both read 99 and both allow a request at limit 100.

Option 2: Local approximation + sync

Each server maintains a local counter. Periodically syncs to Redis (every 100ms). Accept slight overcounting (race window) in exchange for lower Redis load.

For very high-traffic systems (millions of req/sec), the Redis round-trip on every request becomes the bottleneck. Local approximation trades some accuracy for throughput.

Response Headers

Always return these headers — clients need them to implement polite retry:

HTTP

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714900920     ← Unix timestamp when limit resets
Retry-After: 42                   ← Seconds until client can retry

On allowed requests, still return X-RateLimit-Remaining so clients can back off before hitting the limit.

Where to Place the Rate Limiter

Three placement options:

1. API Gateway (recommended for most teams) Rate limiting at the gateway before requests reach any application code. Zero application changes. Works for all services behind the gateway.

2. Middleware per service Rate limiting as middleware inside each service. More granular control but requires each service to implement it.

3. Client-side Never rely on this alone — clients can bypass it. Use it in addition to server-side limiting to reduce unnecessary traffic.

Failure Handling

What happens when Redis is unavailable?

Fail open: Allow all requests when the rate limiter is down. Risk: unprotected traffic during outage. Best for: APIs where availability matters more than strict enforcement.

Fail closed: Block all requests when the rate limiter is down. Risk: total service outage if Redis fails. Best for: security-sensitive endpoints.

Circuit breaker: Track Redis failure rate. If >50% of checks fail, switch to fail-open automatically. Monitors recover and switch back when Redis is healthy.

What Interviewers Are Actually Testing

You know at least two algorithms and can explain when to use each — not just "use a token bucket"
You explain the boundary burst problem with fixed windows
You know Redis Lua scripts for atomicity — critical for distributed correctness
You discuss failure modes — fail open vs fail closed
You return correct headers — shows API design thinking
You place the limiter at the right layer — gateway vs middleware

Quick Reference

Best algorithm:    Sliding window counter (accurate, low memory)
Storage:           Redis (Lua scripts for atomic check-and-increment)
Burst use case:    Token bucket
Smooth output:     Leaky bucket
Response:          429 + Retry-After + X-RateLimit-* headers
Failure:           Fail open (availability) or fail closed (security) — state your assumption
Placement:         API gateway for simplicity, middleware for per-endpoint granularity

Lesson Checkpoint

Case Study: Design a Notification System (Email/SMS/Push)

Next Lesson

Case Study: Design a Real-Time Chat App