Case Study: Design a Notification System (Email/SMS/Push)
Design a scalable multi-channel notification system from scratch ā message queues, fanout, deduplication, rate limiting, delivery guarantees, and the trade-offs that come up in system design interviews.
Notification systems appear deceptively simple ā "just send an email" ā until you're handling 100M users, fanout to millions of devices, deduplication, delivery guarantees, and rate limits per channel. This case study walks through the full design.
Requirements
Functional:
- Support multiple channels: email, SMS, push (mobile), in-app
- Notifications can be triggered by events (user action) or scheduled (marketing campaign)
- Users can configure notification preferences (opt out of SMS, disable emails)
- Support templates with variable substitution
- Delivery receipts (sent, delivered, read)
Non-functional:
- High throughput: up to 1M notifications/second during peak (marketing blasts)
- At-least-once delivery ā it's worse to miss a notification than deliver it twice
- Low latency for transactional notifications (OTP codes, order confirmations): <5s end-to-end
- Best-effort for marketing/bulk: minutes is acceptable
- Deduplication ā the same event must not produce duplicate sends even if processed multiple times
Scale estimates:
Users: 500M
Daily sends: 1B notifications/day ā ~11,600/second average
Peak (campaign): 100M sends in 1 hour ā ~27,700/second
Channels: email (50%), push (35%), SMS (10%), in-app (5%)High-Level Architecture
Event Sources
āāā User actions (API events)
āāā Scheduled jobs (marketing, reminders)
āāā Internal systems (order service, auth service)
ā
Notification Service
(validates, resolves preferences, builds message)
ā
Message Queue (per-channel topics)
āāā email-queue
āāā sms-queue
āāā push-queue
āāā in-app-queue
ā
Channel Workers (independent services)
āāā Email Worker ā SES / SendGrid
āāā SMS Worker ā Twilio / SNS
āāā Push Worker ā FCM / APNs
āāā In-App Worker ā WebSocket / DBThe key design decision: decouple channel delivery from notification creation. The Notification Service puts messages on queues; Channel Workers consume and deliver. They scale independently.
Notification Service (Intake Layer)
Responsibilities:
- Accept notification requests from event sources
- Look up user preferences ā does the user want this notification on this channel?
- Resolve the template ā render the actual message
- Publish one message per channel per user to the appropriate queue
# Pseudocode
def handle_notification_request(event: NotificationEvent):
user = user_service.get(event.user_id)
prefs = preference_service.get(event.user_id, event.notification_type)
for channel in prefs.enabled_channels:
message = template_engine.render(event.template_id, event.variables)
queue_client.publish(
topic=f"{channel}-queue",
payload={
"user_id": event.user_id,
"channel": channel,
"destination": user.channel_address(channel), # email/phone/device_token
"message": message,
"idempotency_key": f"{event.event_id}:{channel}",
}
)The idempotency_key is critical ā it's how downstream workers deduplicate.
Message Queue Design
Use a durable, ordered queue per channel. Kafka or AWS SQS both work.
Why separate queues per channel?
- SMS is slow (providers throttle) ā don't let SMS volume block email delivery
- Push has different retry semantics (FCM/APNs have their own delivery guarantees)
- Allows independent scaling of consumers per channel
Queue configuration:
email-queue: high-throughput, retention 7 days
sms-queue: lower throughput, retention 24h (SMS expires quickly anyway)
push-queue: high-throughput, retention 72h (TTL for push tokens)
in-app-queue: short retention 1h (user is either online or they're not)Channel Workers
Each Channel Worker:
- Consumes from its queue
- Checks the idempotency store (Redis
SET NX) ā if already sent, skip - Calls the provider API (SendGrid, Twilio, FCM)
- Records send status in the notification log
- Handles retries with exponential backoff
def process_message(msg: QueueMessage):
key = msg.idempotency_key
if not redis.set(key, "sent", nx=True, ex=86400):
# Already processed ā skip
return
try:
provider.send(msg.destination, msg.message)
notification_log.record(msg.user_id, msg.channel, "sent")
except ProviderRateLimitError:
# Re-queue with delay
queue.publish_delayed(msg, delay_seconds=60)
except ProviderError as e:
notification_log.record(msg.user_id, msg.channel, "failed", error=str(e))
if msg.retry_count < 3:
queue.publish_delayed(msg_with_retry_count(msg), delay_seconds=exponential_backoff(msg.retry_count))At-least-once vs exactly-once:
- Queues guarantee at-least-once delivery (messages can be re-delivered on crash)
- Idempotency store in Redis prevents duplicate sends even if the same message is processed twice
- Redis key TTL (24h) is fine ā retrying a notification from yesterday is not useful anyway
Fanout for Push Notifications
Push notifications have a unique challenge: one marketing campaign notification needs to fan out to 100M device tokens.
Naive approach: Loop over 100M users and publish 100M messages. This takes too long to be useful.
Better approach: Two-stage fanout
Campaign Service
ā
Fanout Queue (one message per segment)
ā
Fanout Workers (read user segment, expand into per-user messages)
ā
push-queue (per-user messages)
ā
Push Workers ā FCM/APNsFanout Workers fetch user IDs in pages (1,000 at a time), publish per-user push messages to push-queue. 100M users at 1,000/page = 100,000 DB queries ā spread across workers, completable in minutes.
For truly large-scale push (>100M), use FCM's Topic Messaging or Notification Campaigns ā the provider does the fanout.
Preference Store
Users should control what they receive and on which channels.
CREATE TABLE notification_preferences (
user_id BIGINT NOT NULL,
notification_type VARCHAR(100) NOT NULL, -- "order_shipped", "marketing", etc.
channel VARCHAR(20) NOT NULL, -- "email", "sms", "push"
enabled BOOLEAN DEFAULT true,
PRIMARY KEY (user_id, notification_type, channel)
);Cache preferences per user in Redis (TTL: 5 minutes). This table is read on every notification ā it must be fast.
Global unsubscribe must be respected immediately. Store it as a hard block that overrides all preferences, synced to cache within seconds.
Rate Limiting
Providers have rate limits. So should you (to protect users from notification spam).
Provider limits (examples):
- Twilio SMS: 1 message/second per long code; 100/second per short code
- FCM: 600,000 messages/minute project-wide
- SES: 14 sends/second on free tier; scales with account reputation
User-level rate limits (protect the user):
- Max 3 SMS per user per hour (OTP codes are exempt)
- Max 5 push notifications per user per day for marketing
- No limit on transactional (order confirmations, security alerts)
Implement with Redis sliding window rate limiter per user per channel.
Delivery Tracking
Providers offer delivery webhooks. Wire them up:
FCM callback ā Push Worker ā notification_log (delivered/read)
Twilio webhook ā SMS Worker ā notification_log (delivered/failed)
SendGrid webhook ā Email Worker ā notification_log (opened/bounced)Update the log asynchronously ā don't block the send path on status updates.
Failure Scenarios
| Failure | Handling |
|---------|---------|
| Provider outage | Retry queue with exponential backoff; switch to secondary provider if available |
| Queue consumer crashes | Message is re-queued (at-least-once); idempotency prevents duplicate send |
| Database outage | Notification Service queues the message; delivery delayed until DB recovers |
| Invalid device token | FCM returns NotRegistered; mark token as invalid in DB, stop sending to it |
| SMS delivery failure | Retry up to 3 times; fall back to email if available and user permits |
What Interviewers Are Actually Testing
- You decouple intake from delivery ā not a monolith that synchronously calls Twilio
- You separate queues per channel ā shows understanding of independent scaling
- You handle idempotency ā at-least-once delivery without duplicate sends
- You explain two-stage fanout for marketing blasts
- You mention preference management and global unsubscribe
- You discuss provider rate limits and your own user-level rate limits
Quick Reference
Intake: Notification Service ā validates prefs ā publishes to queue
Queues: One per channel (email, SMS, push, in-app); Kafka or SQS
Workers: One per channel; call provider; record to log
Dedup: Redis SET NX with idempotency_key per event+channel
Fanout: Two-stage for large campaigns; FCM Topics for massive scale
Prefs: DB + Redis cache; respected on every send; global unsubscribe = hard block
Rate limit: Per-user per-channel sliding window in Redis
Retries: 3 attempts with exponential backoff; dead-letter queue afterEnjoyed this article?
Explore the System Design learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.