System Design · Lesson 22 of 26

Case Study: Design a Notification System (Email/SMS/Push)

Notification systems appear deceptively simple — "just send an email" — until you're handling 100M users, fanout to millions of devices, deduplication, delivery guarantees, and rate limits per channel. This case study walks through the full design.

Requirements

Functional:

Support multiple channels: email, SMS, push (mobile), in-app
Notifications can be triggered by events (user action) or scheduled (marketing campaign)
Users can configure notification preferences (opt out of SMS, disable emails)
Support templates with variable substitution
Delivery receipts (sent, delivered, read)

Non-functional:

High throughput: up to 1M notifications/second during peak (marketing blasts)
At-least-once delivery — it's worse to miss a notification than deliver it twice
Low latency for transactional notifications (OTP codes, order confirmations): <5s end-to-end
Best-effort for marketing/bulk: minutes is acceptable
Deduplication — the same event must not produce duplicate sends even if processed multiple times

Scale estimates:

Users:          500M
Daily sends:    1B notifications/day → ~11,600/second average
Peak (campaign): 100M sends in 1 hour → ~27,700/second
Channels:       email (50%), push (35%), SMS (10%), in-app (5%)

High-Level Architecture

Event Sources
├── User actions (API events)
├── Scheduled jobs (marketing, reminders)
└── Internal systems (order service, auth service)
         ↓
    Notification Service
    (validates, resolves preferences, builds message)
         ↓
    Message Queue (per-channel topics)
    ├── email-queue
    ├── sms-queue
    ├── push-queue
    └── in-app-queue
         ↓
    Channel Workers (independent services)
    ├── Email Worker → SES / SendGrid
    ├── SMS Worker   → Twilio / SNS
    ├── Push Worker  → FCM / APNs
    └── In-App Worker → WebSocket / DB

The key design decision: decouple channel delivery from notification creation. The Notification Service puts messages on queues; Channel Workers consume and deliver. They scale independently.

Notification Service (Intake Layer)

Responsibilities:

Accept notification requests from event sources
Look up user preferences — does the user want this notification on this channel?
Resolve the template → render the actual message
Publish one message per channel per user to the appropriate queue

Python

# Pseudocode
def handle_notification_request(event: NotificationEvent):
    user = user_service.get(event.user_id)
    prefs = preference_service.get(event.user_id, event.notification_type)

    for channel in prefs.enabled_channels:
        message = template_engine.render(event.template_id, event.variables)
        queue_client.publish(
            topic=f"{channel}-queue",
            payload={
                "user_id": event.user_id,
                "channel": channel,
                "destination": user.channel_address(channel),  # email/phone/device_token
                "message": message,
                "idempotency_key": f"{event.event_id}:{channel}",
            }
        )

The idempotency_key is critical — it's how downstream workers deduplicate.

Message Queue Design

Use a durable, ordered queue per channel. Kafka or AWS SQS both work.

Why separate queues per channel?

SMS is slow (providers throttle) — don't let SMS volume block email delivery
Push has different retry semantics (FCM/APNs have their own delivery guarantees)
Allows independent scaling of consumers per channel

Queue configuration:

email-queue:    high-throughput, retention 7 days
sms-queue:      lower throughput, retention 24h (SMS expires quickly anyway)
push-queue:     high-throughput, retention 72h (TTL for push tokens)
in-app-queue:   short retention 1h (user is either online or they're not)

Channel Workers

Each Channel Worker:

Consumes from its queue
Checks the idempotency store (Redis SET NX) — if already sent, skip
Calls the provider API (SendGrid, Twilio, FCM)
Records send status in the notification log
Handles retries with exponential backoff

Python

def process_message(msg: QueueMessage):
    key = msg.idempotency_key
    if not redis.set(key, "sent", nx=True, ex=86400):
        # Already processed — skip
        return

    try:
        provider.send(msg.destination, msg.message)
        notification_log.record(msg.user_id, msg.channel, "sent")
    except ProviderRateLimitError:
        # Re-queue with delay
        queue.publish_delayed(msg, delay_seconds=60)
    except ProviderError as e:
        notification_log.record(msg.user_id, msg.channel, "failed", error=str(e))
        if msg.retry_count < 3:
            queue.publish_delayed(msg_with_retry_count(msg), delay_seconds=exponential_backoff(msg.retry_count))

At-least-once vs exactly-once:

Queues guarantee at-least-once delivery (messages can be re-delivered on crash)
Idempotency store in Redis prevents duplicate sends even if the same message is processed twice
Redis key TTL (24h) is fine — retrying a notification from yesterday is not useful anyway

Fanout for Push Notifications

Push notifications have a unique challenge: one marketing campaign notification needs to fan out to 100M device tokens.

Naive approach: Loop over 100M users and publish 100M messages. This takes too long to be useful.

Better approach: Two-stage fanout

Campaign Service
    ↓
Fanout Queue (one message per segment)
    ↓
Fanout Workers (read user segment, expand into per-user messages)
    ↓
push-queue (per-user messages)
    ↓
Push Workers → FCM/APNs

Fanout Workers fetch user IDs in pages (1,000 at a time), publish per-user push messages to push-queue. 100M users at 1,000/page = 100,000 DB queries — spread across workers, completable in minutes.

For truly large-scale push (>100M), use FCM's Topic Messaging or Notification Campaigns — the provider does the fanout.

Preference Store

Users should control what they receive and on which channels.

SQL

CREATE TABLE notification_preferences (
    user_id             BIGINT NOT NULL,
    notification_type   VARCHAR(100) NOT NULL,  -- "order_shipped", "marketing", etc.
    channel             VARCHAR(20) NOT NULL,    -- "email", "sms", "push"
    enabled             BOOLEAN DEFAULT true,
    PRIMARY KEY (user_id, notification_type, channel)
);

Cache preferences per user in Redis (TTL: 5 minutes). This table is read on every notification — it must be fast.

Global unsubscribe must be respected immediately. Store it as a hard block that overrides all preferences, synced to cache within seconds.

Rate Limiting

Providers have rate limits. So should you (to protect users from notification spam).

Provider limits (examples):

Twilio SMS: 1 message/second per long code; 100/second per short code
FCM: 600,000 messages/minute project-wide
SES: 14 sends/second on free tier; scales with account reputation

User-level rate limits (protect the user):

Max 3 SMS per user per hour (OTP codes are exempt)
Max 5 push notifications per user per day for marketing
No limit on transactional (order confirmations, security alerts)

Implement with Redis sliding window rate limiter per user per channel.

Delivery Tracking

Providers offer delivery webhooks. Wire them up:

FCM callback → Push Worker → notification_log (delivered/read)
Twilio webhook → SMS Worker → notification_log (delivered/failed)
SendGrid webhook → Email Worker → notification_log (opened/bounced)

Update the log asynchronously — don't block the send path on status updates.

Failure Scenarios

| Failure | Handling | |---------|---------| | Provider outage | Retry queue with exponential backoff; switch to secondary provider if available | | Queue consumer crashes | Message is re-queued (at-least-once); idempotency prevents duplicate send | | Database outage | Notification Service queues the message; delivery delayed until DB recovers | | Invalid device token | FCM returns NotRegistered; mark token as invalid in DB, stop sending to it | | SMS delivery failure | Retry up to 3 times; fall back to email if available and user permits |

What Interviewers Are Actually Testing

You decouple intake from delivery — not a monolith that synchronously calls Twilio
You separate queues per channel — shows understanding of independent scaling
You handle idempotency — at-least-once delivery without duplicate sends
You explain two-stage fanout for marketing blasts
You mention preference management and global unsubscribe
You discuss provider rate limits and your own user-level rate limits

Quick Reference

Intake:      Notification Service → validates prefs → publishes to queue
Queues:      One per channel (email, SMS, push, in-app); Kafka or SQS
Workers:     One per channel; call provider; record to log
Dedup:       Redis SET NX with idempotency_key per event+channel
Fanout:      Two-stage for large campaigns; FCM Topics for massive scale
Prefs:       DB + Redis cache; respected on every send; global unsubscribe = hard block
Rate limit:  Per-user per-channel sliding window in Redis
Retries:     3 attempts with exponential backoff; dead-letter queue after

Case Study: Design a URL Shortener (like bit.ly)

Next Lesson

Case Study: Design a Rate Limiter