← Back to practice catalog

System design interview guide

Design a Notification System

TL;DR: Product teams enqueue **notifications**; your platform fans out to **push**, **email**, and **SMS** with **preferences**, **templates**, **retries**, and **provider** rate limits—without becoming a spam cannon or melting APNs/FCM quotas. The interview is **queueing**, **delivery semantics**, and **ops**—not drawing three vendor logos.

Problem statement

You’re designing a multi-channel notification platform: push, email, SMS (and maybe in-app), with templates, scheduling, user preferences, retries, delivery status, and rate limits per channel and tenant.

Constraints. Functional: enqueue from many services; respect opt-outs; track attempts; idempotent APIs where needed. Non-functionally: high throughput, high delivery target (not perfection), availability of enqueue path. Scale: 100M+ notifications/day, peaks at thousands/sec.

Center: durable queue + channel workers + provider backoff + prefs—not fire-and-forget HTTP to vendors.

Introduction

Notification platforms are distributed queues with policy. Interviewers want backpressure, per-channel failure modes, and preference evaluation before fan-out—not a diagram with three arrows to SendGrid.

Weak answers treat “delivery” as boolean. Strong answers separate enqueue (must succeed fast) from delivery (retries, quotas, 410 Gone on dead tokens).

How to approach

Split ingress (durable record + ack) from delivery (async, retriable). Define transactional vs marketing traffic and compliance (STOP, unsubscribe) early. Then one path: enqueue → prefs → template → provider.

Interview tips

  • FCM/APNs: tokens go stale; sweep invalid tokens; batch sends where APIs allow.
  • DLQ is not optional at scale—poison payloads should not block the whole partition.
  • Email: bounces and suppression lists; SMS: opt-in evidence—one sentence each unless they drill in.
  • Per-user ordering: password reset before marketing promo in the same minute—priority queues or separate pools.
  • Queue lag is your user-visible SLO when latency matters—alert on oldest message age, not only depth.

Capacity estimation

DimensionImplication
100M+ notifications/dayPartition by tenant or hash(user) for fairness
Peak 5K/s enqueueKafka/SQS with provisioned throughput or autoscale consumers
Provider capsToken bucket per provider account; smooth bursts
Device registryHundreds of millions of rows—TTL stale tokens

Implications: you cannot call APNs once per user in a single synchronous loop—batch, async, throttle.

High-level architecture

Producer services call Notification API with { user_id, template_id, vars, channel_hint, idempotency_key }. API persists a notification row and enqueues to Kafka/SQS (or internal queue). Router workers load prefs and template, drop if opted out, route to per-channel queues. Channel workers (push, email, SMS) respect provider rate limits, call vendor APIs, record attempts and terminal status. Scheduler ticks delayed jobs into the same pipeline.

Who owns what:

  • Ingress API — Auth between services; idempotency; schema validation.
  • Preference / device registry — Hot reads; cache with version.
  • Template servicei18n, MJML/HTML for email.
  • Channel workersAdapter code per provider; circuit breakers.
[ Producers ] --> [ Notification API ] --> [ Durable queue ]
                              |
                    router workers (prefs + template)
                              |
         +--------------------+--------------------+
         v                    v                    v
   [ Push queue ]       [ Email queue ]      [ SMS queue ]
         |                    |                    |
         v                    v                    v
   FCM / APNs           SendGrid SES           Twilio etc.

  Scheduled: [ Scheduler ] --> same enqueue path

In the room: Say at-least-once delivery and idempotency keys so retries do not become duplicate SMS.

Core design approaches

Transactional vs marketing

Transactional: OTP, receipt—often bypass marketing throttle with abuse checks only.

Marketing: stricter frequency caps, unsubscribe enforcement, separate quota.

Single vs multi-queue

Separate queues reduce head-of-line blocking—OTP never waits behind a blast.

Detailed design

Write path (enqueue)

  1. Validate caller; insert notifications row PENDING.
  2. Enqueue {notification_id} to stream.
  3. Return 202 with iddo not wait for FCM.

Read path (delivery)

  1. Worker pulls batch.
  2. Load prefs—if channel disabled, mark SUPPRESSED.
  3. Render template; send via provider.
  4. Record attempt; on 429/5xx requeue with backoff; on 410 delete token.

Key challenges

  • Quota: one noisy tenant can burn shared provider accountper-tenant caps.
  • Duplicate user-visible notifications: idempotency key per logical event (e.g. order_id + type).
  • Timezone for scheduled reminders—store UTC + IANA zone.
  • Hot incident: everyone enqueues at once—admission control at API or queue TTL shedding for non-critical categories.

Scaling the system

  • Horizontal workers per channel; partition Kafka by tenant_id or user_id.
  • Isolate noisy tenants to dedicated queues or lower priority.
  • Regional FCM/APNs endpoints—pin workers near provider regions if needed.

Failure handling

ScenarioMitigation
Provider 429Exponential backoff; jitter; circuit openpause partition
Invalid token410remove token; do not retry forever
Template render errorDLQ; alertbug not transient
Queue backlogScale consumers; shed marketing first if incident

Degraded UX: delayed marketing; outage is lost OTPtransactional class needs stricter SLO.

API design

EndpointRole
POST /v1/notificationsEnqueue; body: user_id, template_id, data, channels?, send_after?
GET /v1/notifications/{id}Status and attempts
POST /v1/users/{id}/preferencesUpdate prefs (also from product UI)

POST /v1/notifications

FieldRole
idempotency_keyDedup logical sends
categorytransactional | marketing — routing and caps
template_idWhich template version

Internal flow diagram:

POST /v1/notifications --> DB row + queue message --> 202 Accepted
       |
       v
router --> prefs OK? --> channel worker --> provider
              | no --> SUPPRESSED (still auditable)

Errors: 400 bad template; 429 tenant over quota; 503 enqueue path unhealthy—rare if queue is healthy.

Production angles

Notification platforms sit between your reliability and Apple’s, Google’s, ESPs’, and carriers’. The worst incidents are silent: queues age, marketing drowns transactional, email reputation collapses, and compliance discovers stale opt-outs in cache. You learn these patterns after owning multi-channel dispatch through peak traffic and security incidents.

Push “delivered” but the user never sees it (and email “sent” but not read)

What it looks like — Provider dashboards show delivered; the user claims they never got an OTP or receipt. MFA support tickets spike; fraud rises when people fall back to weaker channels. iOS Focus modes and Android notification channels silently swallow categories.

Why it happens — Delivery to device is not visibility in UI. Tokens go stale after reinstall; battery optimization defers FCM; email lands in Promotions or spam because SPF/DKIM posture slipped. Transactional and marketing traffic that share a domain reputation mean one blast can poison receipts.

What good teams do — In-app inbox plus SMS fallback for high-value transactional (cost-aware); separate subdomains or IPs for marketing vs receipts; bounce and complaint webhooks feeding suppression immediately. Metrics on delivery by provider error code, not just HTTP 202 from your API.

Email reputation death spiral after a “small” campaign

What it looks like — Sudden spike in spam placement; open rates collapse; Gmail Postmaster turns red. Product wants to re-send to “make up” volume—making it worse.

Why it happens — Blast without list hygiene; purchased lists; no double opt-in; ignoring hard bounces; shared IP with bad neighbors on your ESP tier. Warm-up skipped for new sending domains.

What good teams do — Suppression list as source of truth; automatic pause on bounce-rate SLO; segment transactional from marketing at the architecture level; tight feedback loops with marketing ops. SES/SendGrid reputation is borrowed, not owned.

Queue backlog: transactional starves while workers drown in marketing payload

What it looks like — Transactional p99 breaches; a marketing enqueue spike fills Redis or Kafka; queue age SLO goes red before CPU. DLQ depth climbs—often provider throttling (429 from APNs) during a viral moment.

Why it happens — Shared infrastructure without lane isolation; large payloads or attachment generation blocking workers; no global concurrency cap per tenant and channel.

What good teams do — Weighted fair queuing or separate queues for transactional vs marketing; shed lower priority first; dynamic rate limits aligned to provider quotas. Measure enqueue p99, queue age by lane, provider error rate by code, DLQ depth, and end-to-end latency from business event to device.

[ Traffic spike ] --> queue depth grows --> age SLO red
       --> shed marketing --> protect transactional lane

Stale preferences in cache — compliance incident, not performance bug

What it looks like — User opts out; marketing still arrives for minutes. A regulator letter follows.

Why it happens — Aggressive caching of preferences without invalidation on update; eventual consistency between profile service and notification router.

What good teams do — Short TTL plus explicit invalidation on change; pessimistic read for marketing category flags; audit every send decision with the version of prefs used.

How to use this in an interview — State clearly that channel delivery ≠ human attention, and that lanes and reputation are first-class. One sentence on why stale opt-out cache is worse than slow email.

Bottlenecks and tradeoffs

  • Throughput vs cost: more workers and provider spend.
  • Strong prefs vs latency: cache prefs—stale opt-out is a compliance incidentshort TTL + invalidation.

What interviewers expect

  • Architecture: ingress API → durable queue → router (prefs + templates) → channel workers → provider adapters.
  • Data: notification record, per-attempt delivery rows, user channel preferences, device token registry.
  • Reliability: at-least-once with idempotency keys; retries with jitter; DLQ for poison.
  • Rate limits: per user, per tenant, per provider account; separate queues for transactional vs marketing.
  • Templates: versioning, localization, variable substitution with safe defaults.
  • Observability: delivery metrics, provider error codes, queue lag SLO.
  • Compliance: SMS STOP, email unsubscribe, audit for regulated content.

Interview workflow (template)

  1. Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
  2. Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
  3. APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
  4. Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
  5. Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
  6. Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).

Frequently asked follow-ups

  • How do you guarantee delivery?
  • How do you handle provider rate limits?
  • What’s the difference between transactional and marketing notifications?
  • How do user preferences get applied?
  • How do you avoid duplicate notifications?

Deep-dive questions and strong answer outlines

Walk through sending a push notification to 1M users.

Enqueue work items (shard by user segments); fan-out workers pull batches; respect FCM/APNs batch APIs and token invalidation on failure. Throttle globally and per app; monitor queue lag. Not one giant HTTP call.

How do retries work without spamming users?

Exponential backoff; max attempts; DLQ for manual inspection. Idempotency key per logical notification so retries don’t duplicate user-visible alerts if client dedupes poorly.

How are preferences stored and enforced?

Per-channel booleans or granular categories; evaluate before expensive fan-out. Cache with version stamp; invalidate on user change—stale prefs cause trust incidents.

Production angles

  • APNs token flood of 410 Gone—**sweep** invalid tokens or users get ghost failures.
  • Provider outage—**failover** secondary SMS route or **delay** with user messaging.

AI feedback on your design

After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.

FAQs

Q: Is Kafka the whole system?

A: Kafka (or similar) is often the backbone, but you still need state for attempts, template service, and device token registry. Transport ≠ product.

Q: Do I need exactly-once delivery?

A: End-to-end exactly-once to humans is hard. Aim for at-least-once with dedupe keys and best-effort user experience; be honest about email/push quirks.

Q: How is this different from a job scheduler?

A: Overlap on delayed sends, but notifications emphasize channel adapters, quotas, and per-user policy—schedulers emphasize cron and compute jobs.

Q: How do you prioritize urgent vs marketing notifications?

A: Separate queues or priority classes with budgets; transactional messages bypass marketing throttle. Prevent starvation with aging or caps per channel so promos don’t block OTP forever.

Practice interactively

Open the practice session to use the canvas and stages, then review AI feedback.