System Design · Lesson 22 of 26
Case Study: Design a Notification System (Email/SMS/Push)
Notification systems appear deceptively simple — "just send an email" — until you're handling 100M users, fanout to millions of devices, deduplication, delivery guarantees, and rate limits per channel. This case study walks through the full design.
Requirements
Functional:
- Support multiple channels: email, SMS, push (mobile), in-app
- Notifications can be triggered by events (user action) or scheduled (marketing campaign)
- Users can configure notification preferences (opt out of SMS, disable emails)
- Support templates with variable substitution
- Delivery receipts (sent, delivered, read)
Non-functional:
- High throughput: up to 1M notifications/second during peak (marketing blasts)
- At-least-once delivery — it's worse to miss a notification than deliver it twice
- Low latency for transactional notifications (OTP codes, order confirmations): <5s end-to-end
- Best-effort for marketing/bulk: minutes is acceptable
- Deduplication — the same event must not produce duplicate sends even if processed multiple times
Scale estimates:
Users: 500M
Daily sends: 1B notifications/day → ~11,600/second average
Peak (campaign): 100M sends in 1 hour → ~27,700/second
Channels: email (50%), push (35%), SMS (10%), in-app (5%)High-Level Architecture
Event Sources
├── User actions (API events)
├── Scheduled jobs (marketing, reminders)
└── Internal systems (order service, auth service)
↓
Notification Service
(validates, resolves preferences, builds message)
↓
Message Queue (per-channel topics)
├── email-queue
├── sms-queue
├── push-queue
└── in-app-queue
↓
Channel Workers (independent services)
├── Email Worker → SES / SendGrid
├── SMS Worker → Twilio / SNS
├── Push Worker → FCM / APNs
└── In-App Worker → WebSocket / DBThe key design decision: decouple channel delivery from notification creation. The Notification Service puts messages on queues; Channel Workers consume and deliver. They scale independently.
Notification Service (Intake Layer)
Responsibilities:
- Accept notification requests from event sources
- Look up user preferences — does the user want this notification on this channel?
- Resolve the template → render the actual message
- Publish one message per channel per user to the appropriate queue
# Pseudocode
def handle_notification_request(event: NotificationEvent):
user = user_service.get(event.user_id)
prefs = preference_service.get(event.user_id, event.notification_type)
for channel in prefs.enabled_channels:
message = template_engine.render(event.template_id, event.variables)
queue_client.publish(
topic=f"{channel}-queue",
payload={
"user_id": event.user_id,
"channel": channel,
"destination": user.channel_address(channel), # email/phone/device_token
"message": message,
"idempotency_key": f"{event.event_id}:{channel}",
}
)The idempotency_key is critical — it's how downstream workers deduplicate.
Message Queue Design
Use a durable, ordered queue per channel. Kafka or AWS SQS both work.
Why separate queues per channel?
- SMS is slow (providers throttle) — don't let SMS volume block email delivery
- Push has different retry semantics (FCM/APNs have their own delivery guarantees)
- Allows independent scaling of consumers per channel
Queue configuration:
email-queue: high-throughput, retention 7 days
sms-queue: lower throughput, retention 24h (SMS expires quickly anyway)
push-queue: high-throughput, retention 72h (TTL for push tokens)
in-app-queue: short retention 1h (user is either online or they're not)Channel Workers
Each Channel Worker:
- Consumes from its queue
- Checks the idempotency store (Redis
SET NX) — if already sent, skip - Calls the provider API (SendGrid, Twilio, FCM)
- Records send status in the notification log
- Handles retries with exponential backoff
def process_message(msg: QueueMessage):
key = msg.idempotency_key
if not redis.set(key, "sent", nx=True, ex=86400):
# Already processed — skip
return
try:
provider.send(msg.destination, msg.message)
notification_log.record(msg.user_id, msg.channel, "sent")
except ProviderRateLimitError:
# Re-queue with delay
queue.publish_delayed(msg, delay_seconds=60)
except ProviderError as e:
notification_log.record(msg.user_id, msg.channel, "failed", error=str(e))
if msg.retry_count < 3:
queue.publish_delayed(msg_with_retry_count(msg), delay_seconds=exponential_backoff(msg.retry_count))At-least-once vs exactly-once:
- Queues guarantee at-least-once delivery (messages can be re-delivered on crash)
- Idempotency store in Redis prevents duplicate sends even if the same message is processed twice
- Redis key TTL (24h) is fine — retrying a notification from yesterday is not useful anyway
Fanout for Push Notifications
Push notifications have a unique challenge: one marketing campaign notification needs to fan out to 100M device tokens.
Naive approach: Loop over 100M users and publish 100M messages. This takes too long to be useful.
Better approach: Two-stage fanout
Campaign Service
↓
Fanout Queue (one message per segment)
↓
Fanout Workers (read user segment, expand into per-user messages)
↓
push-queue (per-user messages)
↓
Push Workers → FCM/APNsFanout Workers fetch user IDs in pages (1,000 at a time), publish per-user push messages to push-queue. 100M users at 1,000/page = 100,000 DB queries — spread across workers, completable in minutes.
For truly large-scale push (>100M), use FCM's Topic Messaging or Notification Campaigns — the provider does the fanout.
Preference Store
Users should control what they receive and on which channels.
CREATE TABLE notification_preferences (
user_id BIGINT NOT NULL,
notification_type VARCHAR(100) NOT NULL, -- "order_shipped", "marketing", etc.
channel VARCHAR(20) NOT NULL, -- "email", "sms", "push"
enabled BOOLEAN DEFAULT true,
PRIMARY KEY (user_id, notification_type, channel)
);Cache preferences per user in Redis (TTL: 5 minutes). This table is read on every notification — it must be fast.
Global unsubscribe must be respected immediately. Store it as a hard block that overrides all preferences, synced to cache within seconds.
Rate Limiting
Providers have rate limits. So should you (to protect users from notification spam).
Provider limits (examples):
- Twilio SMS: 1 message/second per long code; 100/second per short code
- FCM: 600,000 messages/minute project-wide
- SES: 14 sends/second on free tier; scales with account reputation
User-level rate limits (protect the user):
- Max 3 SMS per user per hour (OTP codes are exempt)
- Max 5 push notifications per user per day for marketing
- No limit on transactional (order confirmations, security alerts)
Implement with Redis sliding window rate limiter per user per channel.
Delivery Tracking
Providers offer delivery webhooks. Wire them up:
FCM callback → Push Worker → notification_log (delivered/read)
Twilio webhook → SMS Worker → notification_log (delivered/failed)
SendGrid webhook → Email Worker → notification_log (opened/bounced)Update the log asynchronously — don't block the send path on status updates.
Failure Scenarios
| Failure | Handling |
|---------|---------|
| Provider outage | Retry queue with exponential backoff; switch to secondary provider if available |
| Queue consumer crashes | Message is re-queued (at-least-once); idempotency prevents duplicate send |
| Database outage | Notification Service queues the message; delivery delayed until DB recovers |
| Invalid device token | FCM returns NotRegistered; mark token as invalid in DB, stop sending to it |
| SMS delivery failure | Retry up to 3 times; fall back to email if available and user permits |
What Interviewers Are Actually Testing
- You decouple intake from delivery — not a monolith that synchronously calls Twilio
- You separate queues per channel — shows understanding of independent scaling
- You handle idempotency — at-least-once delivery without duplicate sends
- You explain two-stage fanout for marketing blasts
- You mention preference management and global unsubscribe
- You discuss provider rate limits and your own user-level rate limits
Quick Reference
Intake: Notification Service → validates prefs → publishes to queue
Queues: One per channel (email, SMS, push, in-app); Kafka or SQS
Workers: One per channel; call provider; record to log
Dedup: Redis SET NX with idempotency_key per event+channel
Fanout: Two-stage for large campaigns; FCM Topics for massive scale
Prefs: DB + Redis cache; respected on every send; global unsubscribe = hard block
Rate limit: Per-user per-channel sliding window in Redis
Retries: 3 attempts with exponential backoff; dead-letter queue after