System Design Interview
Design a Push Notification System
Sending 10 billion notifications a day across iOS, Android, and email — reliably, without duplication
The Interview Question
"Design a push notification system that sends notifications to users across iOS, Android, email, and SMS. The system must handle 10 billion notifications per day reliably, without duplicate delivery."
This question tests whether you understand the difference between notification generation (your system) and notification delivery (someone else's system), and how to build a reliable pipeline between them.
Step 1: Requirements
Functional
- Send notifications triggered by events: new message, order shipped, price drop, security alert
- Support channels: push (iOS APNs, Android FCM), email (SendGrid), SMS (Twilio)
- Users can configure preferences: opt out of channels, set quiet hours
- Retry failed deliveries
- Deduplicate: the same notification should not be delivered twice
Non-functional
- 10 billion notifications per day (~115,000/second at peak)
- Delivery within 5 seconds for high-priority notifications
- At-least-once delivery guarantee (retries on failure)
- Exactly-once: deduplication prevents duplicates despite retries
- No silent failures — every failed delivery should be recorded
Step 2: Architecture Overview
Event Sources Notification Pipeline
─────────────────────────────────────────────────────────────────
Order Service ──┐
Chat Service ──┤ ┌──────────────────────┐
Payment Svc ──┼──► Kafka ──► │ Notification Worker │
Marketing ──┘ │ (fan-out, routing) │
└──────────┬───────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
Push Queue Email Queue SMS Queue
(Kafka topic) (Kafka topic) (Kafka topic)
│ │ │
┌───────▼──┐ ┌──────▼──┐ ┌─────▼─────┐
│ Push │ │ Email │ │ SMS │
│ Sender │ │ Sender │ │ Sender │
└───┬──────┘ └────┬────┘ └─────┬─────┘
│ │ │
APNs/FCM SendGrid TwilioStep 3: The Two Separate Problems
Problem 1: Notification Generation Something happens → who needs to be notified? This is business logic. The Order Service knows an order shipped; it needs to tell the notification system who to notify and what to say.
Problem 2: Notification Delivery Given "send this message to this device/email/phone", get it there reliably. This is infrastructure.
Keep these separated. Your notification system receives notification events, not raw business events. The Order Service doesn't say "order 12345 status changed to shipped" — it says "notify user_id 67890: Your order has shipped."
Step 4: Kafka as the Backbone
Why Kafka and not a simpler queue?
At 115,000 notifications/second:
One RabbitMQ queue → ~50,000 messages/second maximum
Kafka partition → ~1,000,000 messages/second (100x more headroom)
Kafka also gives you:
Replay capability: reprocess failed notifications from 7 days ago
Fan-out: multiple consumers can read the same topic independently
Exactly-once semantics (with Kafka Transactions)
Durable: messages survive consumer crashesEach notification type gets its own Kafka topic:
notification.push.high_priority (order confirmations, security alerts)
notification.push.low_priority (marketing, suggestions)
notification.email
notification.smsHigh-priority topics have more consumer workers → lower latency.
Step 5: User Preference Routing
Before sending anything, check user preferences:
Notification event arrives:
user_id: 12345
type: "order_shipped"
channel: "push"
Preference lookup (Redis cache):
Key: prefs:12345
Value: {
push_enabled: true,
email_enabled: false,
quiet_hours: {start: "22:00", end: "08:00", tz: "Europe/Oslo"},
email: "user@example.com",
push_tokens: [{token: "abc...", platform: "ios"}]
}
Decision logic:
Is push_enabled? Yes
Is current time in quiet_hours? No
→ Route to Push Sender with token "abc..."
If yes to quiet hours:
→ Delay notification: enqueue to "scheduled" queue with deliver_at = end_of_quiet_hoursPreferences are cached in Redis (TTL 5 minutes) — this is a read-heavy, write-rare dataset.
Step 6: Deduplication — Preventing Duplicate Delivery
Retries are necessary (networks fail, APNs times out). But retries risk sending the same notification twice.
Solution: idempotency keys + Redis deduplication window
Every notification event has a unique notification_id generated by the source service.
Before sending:
Redis SET NX dedup:{notification_id} = "sent" TTL 24h
SET NX = "Set if Not eXists"
If SET returned OK → first attempt → proceed to send
If SET returned nil → already sent → skip (idempotent drop)If the sender crashes after sending but before recording success, the next retry attempts to SET NX — it's already set, so the message is not sent again. This gives at-most-once on re-sends with the dedup window, combined with retry logic for at-least-once on first delivery.
Step 7: Delivery to APNs / FCM
Your system doesn't deliver to phones directly — that's Apple's (APNs) and Google's (FCM) job. You call their APIs, and they push to the device.
Push Sender workflow:
1. Receive notification task from Kafka
2. Look up device token for user (from device_tokens table)
3. Build platform-specific payload:
iOS (APNs): {"aps": {"alert": {"title": "...", "body": "..."}, "badge": 1}}
Android (FCM): {"notification": {"title": "...", "body": "..."}, "to": token}
4. Call APNs/FCM API
5. Handle response:
200 OK → record as delivered in notification_log
410 Gone → device token is invalid (user uninstalled app) → delete token from DB
500 / timeout → retry with exponential backoffDevice tokens expire when users reinstall apps or change phones. Cleaning up stale tokens prevents wasting sends and keeps your sender reputation healthy.
Step 8: Rate Limiting and Abuse Prevention
Sending 10B notifications/day is only legitimate if they're genuinely wanted. Spam ruins deliverability.
Per-user rate limits:
Max 50 push notifications per user per day
Max 10 marketing emails per user per week
Max 5 SMS per user per week
Implementation:
Redis counter: notif_count:{user_id}:{channel}:{date}
INCR before each send
If count > limit → drop notification, log as "rate limited"Providers (APNs, Gmail, carriers) will blacklist your app/domain if your delivery rates drop below thresholds. Rate limiting protects your platform-wide deliverability.
Step 9: Database Schema
NOTIFICATION_LOG
id UUID PRIMARY KEY
user_id UUID NOT NULL
type TEXT (order_shipped, message_received, etc.)
channel ENUM (push, email, sms)
status ENUM (queued, sent, delivered, failed)
payload JSONB
idempotency_key TEXT UNIQUE
created_at TIMESTAMPTZ
sent_at TIMESTAMPTZ
delivered_at TIMESTAMPTZ
DEVICE_TOKENS
id UUID PRIMARY KEY
user_id UUID NOT NULL
token TEXT NOT NULL
platform ENUM (ios, android)
created_at TIMESTAMPTZ
last_used_at TIMESTAMPTZ
USER_PREFERENCES
user_id UUID PRIMARY KEY
push_enabled BOOLEAN DEFAULT true
email_enabled BOOLEAN DEFAULT true
sms_enabled BOOLEAN DEFAULT false
quiet_start TIME (nullable)
quiet_end TIME (nullable)
timezone TEXTWhat the Interviewer Is Actually Testing
- Do you clearly separate notification generation from delivery?
- Do you choose Kafka and justify it over a simpler queue?
- Do you explain deduplication using idempotency keys and Redis NX?
- Do you route through APNs/FCM rather than claiming to deliver to phones directly?
- Do you handle preference routing including quiet hours?
- Do you think about rate limiting to protect deliverability?
- Do you handle stale device tokens gracefully?
Related Case Studies
Go Deeper
Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.
Explore Courses