Back to Case Studies
system-designintermediate 12 min read

System Design Interview

Design a Push Notification System

Sending 10 billion notifications a day across iOS, Android, and email — reliably, without duplication

Key outcome: 10B notifications/day, <3% failure
System DesignKafkaPush NotificationsFan-outRate LimitingQueue

The Interview Question

"Design a push notification system that sends notifications to users across iOS, Android, email, and SMS. The system must handle 10 billion notifications per day reliably, without duplicate delivery."

This question tests whether you understand the difference between notification generation (your system) and notification delivery (someone else's system), and how to build a reliable pipeline between them.


Step 1: Requirements

Functional

  • Send notifications triggered by events: new message, order shipped, price drop, security alert
  • Support channels: push (iOS APNs, Android FCM), email (SendGrid), SMS (Twilio)
  • Users can configure preferences: opt out of channels, set quiet hours
  • Retry failed deliveries
  • Deduplicate: the same notification should not be delivered twice

Non-functional

  • 10 billion notifications per day (~115,000/second at peak)
  • Delivery within 5 seconds for high-priority notifications
  • At-least-once delivery guarantee (retries on failure)
  • Exactly-once: deduplication prevents duplicates despite retries
  • No silent failures — every failed delivery should be recorded

Step 2: Architecture Overview

Event Sources                      Notification Pipeline
─────────────────────────────────────────────────────────────────

  Order Service ──┐
  Chat Service  ──┤              ┌──────────────────────┐
  Payment Svc   ──┼──► Kafka ──► │  Notification Worker │
  Marketing     ──┘              │  (fan-out, routing)  │
                                 └──────────┬───────────┘
                                            │
                             ┌──────────────┼──────────────┐
                             ▼              ▼              ▼
                       Push Queue      Email Queue    SMS Queue
                      (Kafka topic)  (Kafka topic)  (Kafka topic)
                             │              │              │
                     ┌───────▼──┐    ┌──────▼──┐    ┌─────▼─────┐
                     │  Push    │    │  Email  │    │   SMS     │
                     │ Sender   │    │ Sender  │    │  Sender   │
                     └───┬──────┘    └────┬────┘    └─────┬─────┘
                         │               │                │
                    APNs/FCM          SendGrid          Twilio

Step 3: The Two Separate Problems

Problem 1: Notification Generation Something happens → who needs to be notified? This is business logic. The Order Service knows an order shipped; it needs to tell the notification system who to notify and what to say.

Problem 2: Notification Delivery Given "send this message to this device/email/phone", get it there reliably. This is infrastructure.

Keep these separated. Your notification system receives notification events, not raw business events. The Order Service doesn't say "order 12345 status changed to shipped" — it says "notify user_id 67890: Your order has shipped."


Step 4: Kafka as the Backbone

Why Kafka and not a simpler queue?

At 115,000 notifications/second:
  One RabbitMQ queue → ~50,000 messages/second maximum
  Kafka partition   → ~1,000,000 messages/second (100x more headroom)
  
Kafka also gives you:
  Replay capability: reprocess failed notifications from 7 days ago
  Fan-out: multiple consumers can read the same topic independently
  Exactly-once semantics (with Kafka Transactions)
  Durable: messages survive consumer crashes

Each notification type gets its own Kafka topic:

notification.push.high_priority   (order confirmations, security alerts)
notification.push.low_priority    (marketing, suggestions)
notification.email
notification.sms

High-priority topics have more consumer workers → lower latency.


Step 5: User Preference Routing

Before sending anything, check user preferences:

Notification event arrives:
  user_id: 12345
  type: "order_shipped"
  channel: "push"

Preference lookup (Redis cache):
  Key: prefs:12345
  Value: {
    push_enabled: true,
    email_enabled: false,
    quiet_hours: {start: "22:00", end: "08:00", tz: "Europe/Oslo"},
    email: "user@example.com",
    push_tokens: [{token: "abc...", platform: "ios"}]
  }

Decision logic:
  Is push_enabled? Yes
  Is current time in quiet_hours? No
  → Route to Push Sender with token "abc..."
  
  If yes to quiet hours:
  → Delay notification: enqueue to "scheduled" queue with deliver_at = end_of_quiet_hours

Preferences are cached in Redis (TTL 5 minutes) — this is a read-heavy, write-rare dataset.


Step 6: Deduplication — Preventing Duplicate Delivery

Retries are necessary (networks fail, APNs times out). But retries risk sending the same notification twice.

Solution: idempotency keys + Redis deduplication window

Every notification event has a unique notification_id generated by the source service.

Before sending:
  Redis SET NX dedup:{notification_id} = "sent" TTL 24h
  SET NX = "Set if Not eXists"
  
  If SET returned OK → first attempt → proceed to send
  If SET returned nil → already sent → skip (idempotent drop)

If the sender crashes after sending but before recording success, the next retry attempts to SET NX — it's already set, so the message is not sent again. This gives at-most-once on re-sends with the dedup window, combined with retry logic for at-least-once on first delivery.


Step 7: Delivery to APNs / FCM

Your system doesn't deliver to phones directly — that's Apple's (APNs) and Google's (FCM) job. You call their APIs, and they push to the device.

Push Sender workflow:
  1. Receive notification task from Kafka
  2. Look up device token for user (from device_tokens table)
  3. Build platform-specific payload:
     iOS (APNs): {"aps": {"alert": {"title": "...", "body": "..."}, "badge": 1}}
     Android (FCM): {"notification": {"title": "...", "body": "..."}, "to": token}
  4. Call APNs/FCM API
  5. Handle response:
     200 OK → record as delivered in notification_log
     410 Gone → device token is invalid (user uninstalled app) → delete token from DB
     500 / timeout → retry with exponential backoff

Device tokens expire when users reinstall apps or change phones. Cleaning up stale tokens prevents wasting sends and keeps your sender reputation healthy.


Step 8: Rate Limiting and Abuse Prevention

Sending 10B notifications/day is only legitimate if they're genuinely wanted. Spam ruins deliverability.

Per-user rate limits:
  Max 50 push notifications per user per day
  Max 10 marketing emails per user per week
  Max 5 SMS per user per week

Implementation:
  Redis counter: notif_count:{user_id}:{channel}:{date}
  INCR before each send
  If count > limit → drop notification, log as "rate limited"

Providers (APNs, Gmail, carriers) will blacklist your app/domain if your delivery rates drop below thresholds. Rate limiting protects your platform-wide deliverability.


Step 9: Database Schema

NOTIFICATION_LOG
  id              UUID  PRIMARY KEY
  user_id         UUID  NOT NULL
  type            TEXT  (order_shipped, message_received, etc.)
  channel         ENUM  (push, email, sms)
  status          ENUM  (queued, sent, delivered, failed)
  payload         JSONB
  idempotency_key TEXT  UNIQUE
  created_at      TIMESTAMPTZ
  sent_at         TIMESTAMPTZ
  delivered_at    TIMESTAMPTZ

DEVICE_TOKENS
  id              UUID  PRIMARY KEY
  user_id         UUID  NOT NULL
  token           TEXT  NOT NULL
  platform        ENUM  (ios, android)
  created_at      TIMESTAMPTZ
  last_used_at    TIMESTAMPTZ

USER_PREFERENCES
  user_id         UUID  PRIMARY KEY
  push_enabled    BOOLEAN DEFAULT true
  email_enabled   BOOLEAN DEFAULT true
  sms_enabled     BOOLEAN DEFAULT false
  quiet_start     TIME  (nullable)
  quiet_end       TIME  (nullable)
  timezone        TEXT

What the Interviewer Is Actually Testing

  • Do you clearly separate notification generation from delivery?
  • Do you choose Kafka and justify it over a simpler queue?
  • Do you explain deduplication using idempotency keys and Redis NX?
  • Do you route through APNs/FCM rather than claiming to deliver to phones directly?
  • Do you handle preference routing including quiet hours?
  • Do you think about rate limiting to protect deliverability?
  • Do you handle stale device tokens gracefully?

Related Case Studies

Go Deeper

Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.

Explore Courses