Back to blog
System Designintermediate

Case Study: Design a Notification System (Email/SMS/Push)

Design a scalable multi-channel notification system from scratch — message queues, fanout, deduplication, rate limiting, delivery guarantees, and the trade-offs that come up in system design interviews.

LearnixoApril 15, 20267 min read
System DesignCase StudyMessage QueueFanoutNotificationsInterview Prep
Share:š•

Notification systems appear deceptively simple — "just send an email" — until you're handling 100M users, fanout to millions of devices, deduplication, delivery guarantees, and rate limits per channel. This case study walks through the full design.


Requirements

Functional:

  • Support multiple channels: email, SMS, push (mobile), in-app
  • Notifications can be triggered by events (user action) or scheduled (marketing campaign)
  • Users can configure notification preferences (opt out of SMS, disable emails)
  • Support templates with variable substitution
  • Delivery receipts (sent, delivered, read)

Non-functional:

  • High throughput: up to 1M notifications/second during peak (marketing blasts)
  • At-least-once delivery — it's worse to miss a notification than deliver it twice
  • Low latency for transactional notifications (OTP codes, order confirmations): <5s end-to-end
  • Best-effort for marketing/bulk: minutes is acceptable
  • Deduplication — the same event must not produce duplicate sends even if processed multiple times

Scale estimates:

Users:          500M
Daily sends:    1B notifications/day → ~11,600/second average
Peak (campaign): 100M sends in 1 hour → ~27,700/second
Channels:       email (50%), push (35%), SMS (10%), in-app (5%)

High-Level Architecture

Event Sources
ā”œā”€ā”€ User actions (API events)
ā”œā”€ā”€ Scheduled jobs (marketing, reminders)
└── Internal systems (order service, auth service)
         ↓
    Notification Service
    (validates, resolves preferences, builds message)
         ↓
    Message Queue (per-channel topics)
    ā”œā”€ā”€ email-queue
    ā”œā”€ā”€ sms-queue
    ā”œā”€ā”€ push-queue
    └── in-app-queue
         ↓
    Channel Workers (independent services)
    ā”œā”€ā”€ Email Worker → SES / SendGrid
    ā”œā”€ā”€ SMS Worker   → Twilio / SNS
    ā”œā”€ā”€ Push Worker  → FCM / APNs
    └── In-App Worker → WebSocket / DB

The key design decision: decouple channel delivery from notification creation. The Notification Service puts messages on queues; Channel Workers consume and deliver. They scale independently.


Notification Service (Intake Layer)

Responsibilities:

  1. Accept notification requests from event sources
  2. Look up user preferences — does the user want this notification on this channel?
  3. Resolve the template → render the actual message
  4. Publish one message per channel per user to the appropriate queue
Python
# Pseudocode
def handle_notification_request(event: NotificationEvent):
    user = user_service.get(event.user_id)
    prefs = preference_service.get(event.user_id, event.notification_type)

    for channel in prefs.enabled_channels:
        message = template_engine.render(event.template_id, event.variables)
        queue_client.publish(
            topic=f"{channel}-queue",
            payload={
                "user_id": event.user_id,
                "channel": channel,
                "destination": user.channel_address(channel),  # email/phone/device_token
                "message": message,
                "idempotency_key": f"{event.event_id}:{channel}",
            }
        )

The idempotency_key is critical — it's how downstream workers deduplicate.


Message Queue Design

Use a durable, ordered queue per channel. Kafka or AWS SQS both work.

Why separate queues per channel?

  • SMS is slow (providers throttle) — don't let SMS volume block email delivery
  • Push has different retry semantics (FCM/APNs have their own delivery guarantees)
  • Allows independent scaling of consumers per channel

Queue configuration:

email-queue:    high-throughput, retention 7 days
sms-queue:      lower throughput, retention 24h (SMS expires quickly anyway)
push-queue:     high-throughput, retention 72h (TTL for push tokens)
in-app-queue:   short retention 1h (user is either online or they're not)

Channel Workers

Each Channel Worker:

  1. Consumes from its queue
  2. Checks the idempotency store (Redis SET NX) — if already sent, skip
  3. Calls the provider API (SendGrid, Twilio, FCM)
  4. Records send status in the notification log
  5. Handles retries with exponential backoff
Python
def process_message(msg: QueueMessage):
    key = msg.idempotency_key
    if not redis.set(key, "sent", nx=True, ex=86400):
        # Already processed — skip
        return

    try:
        provider.send(msg.destination, msg.message)
        notification_log.record(msg.user_id, msg.channel, "sent")
    except ProviderRateLimitError:
        # Re-queue with delay
        queue.publish_delayed(msg, delay_seconds=60)
    except ProviderError as e:
        notification_log.record(msg.user_id, msg.channel, "failed", error=str(e))
        if msg.retry_count < 3:
            queue.publish_delayed(msg_with_retry_count(msg), delay_seconds=exponential_backoff(msg.retry_count))

At-least-once vs exactly-once:

  • Queues guarantee at-least-once delivery (messages can be re-delivered on crash)
  • Idempotency store in Redis prevents duplicate sends even if the same message is processed twice
  • Redis key TTL (24h) is fine — retrying a notification from yesterday is not useful anyway

Fanout for Push Notifications

Push notifications have a unique challenge: one marketing campaign notification needs to fan out to 100M device tokens.

Naive approach: Loop over 100M users and publish 100M messages. This takes too long to be useful.

Better approach: Two-stage fanout

Campaign Service
    ↓
Fanout Queue (one message per segment)
    ↓
Fanout Workers (read user segment, expand into per-user messages)
    ↓
push-queue (per-user messages)
    ↓
Push Workers → FCM/APNs

Fanout Workers fetch user IDs in pages (1,000 at a time), publish per-user push messages to push-queue. 100M users at 1,000/page = 100,000 DB queries — spread across workers, completable in minutes.

For truly large-scale push (>100M), use FCM's Topic Messaging or Notification Campaigns — the provider does the fanout.


Preference Store

Users should control what they receive and on which channels.

SQL
CREATE TABLE notification_preferences (
    user_id             BIGINT NOT NULL,
    notification_type   VARCHAR(100) NOT NULL,  -- "order_shipped", "marketing", etc.
    channel             VARCHAR(20) NOT NULL,    -- "email", "sms", "push"
    enabled             BOOLEAN DEFAULT true,
    PRIMARY KEY (user_id, notification_type, channel)
);

Cache preferences per user in Redis (TTL: 5 minutes). This table is read on every notification — it must be fast.

Global unsubscribe must be respected immediately. Store it as a hard block that overrides all preferences, synced to cache within seconds.


Rate Limiting

Providers have rate limits. So should you (to protect users from notification spam).

Provider limits (examples):

  • Twilio SMS: 1 message/second per long code; 100/second per short code
  • FCM: 600,000 messages/minute project-wide
  • SES: 14 sends/second on free tier; scales with account reputation

User-level rate limits (protect the user):

  • Max 3 SMS per user per hour (OTP codes are exempt)
  • Max 5 push notifications per user per day for marketing
  • No limit on transactional (order confirmations, security alerts)

Implement with Redis sliding window rate limiter per user per channel.


Delivery Tracking

Providers offer delivery webhooks. Wire them up:

FCM callback → Push Worker → notification_log (delivered/read)
Twilio webhook → SMS Worker → notification_log (delivered/failed)
SendGrid webhook → Email Worker → notification_log (opened/bounced)

Update the log asynchronously — don't block the send path on status updates.


Failure Scenarios

| Failure | Handling | |---------|---------| | Provider outage | Retry queue with exponential backoff; switch to secondary provider if available | | Queue consumer crashes | Message is re-queued (at-least-once); idempotency prevents duplicate send | | Database outage | Notification Service queues the message; delivery delayed until DB recovers | | Invalid device token | FCM returns NotRegistered; mark token as invalid in DB, stop sending to it | | SMS delivery failure | Retry up to 3 times; fall back to email if available and user permits |


What Interviewers Are Actually Testing

  1. You decouple intake from delivery — not a monolith that synchronously calls Twilio
  2. You separate queues per channel — shows understanding of independent scaling
  3. You handle idempotency — at-least-once delivery without duplicate sends
  4. You explain two-stage fanout for marketing blasts
  5. You mention preference management and global unsubscribe
  6. You discuss provider rate limits and your own user-level rate limits

Quick Reference

Intake:      Notification Service → validates prefs → publishes to queue
Queues:      One per channel (email, SMS, push, in-app); Kafka or SQS
Workers:     One per channel; call provider; record to log
Dedup:       Redis SET NX with idempotency_key per event+channel
Fanout:      Two-stage for large campaigns; FCM Topics for massive scale
Prefs:       DB + Redis cache; respected on every send; global unsubscribe = hard block
Rate limit:  Per-user per-channel sliding window in Redis
Retries:     3 attempts with exponential backoff; dead-letter queue after

Enjoyed this article?

Explore the System Design learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.