Flash sale spike · notifications 28 min late · duplicates · DB at 100%

E-commerce Ops · Incident brief

The Notification System That Broke During a Flash Sale

Flash sale spike · notifications 28 min late · duplicates · DB at 100%

Problem statement

During a flash sale on a large e-commerce platform, the notification system failed under a ~60× traffic spike (5k → 300k notifications/min). Order confirmations and payment alerts were delayed 20–30 minutes. Some users received duplicate SMS and emails; others received nothing. PostgreSQL CPU hit 100% and connection pool exhausted (498/500). An external SMS provider timed out randomly, causing sender pod OOM crashes and retry storms.

The team's whiteboard shows only Client Apps → Notification API → Notification service → PostgreSQL. Delivery workers, message queues, idempotency keys, circuit breakers, and provider integrations were never drawn — but the incident proves they exist implicitly under load.

During a flash sale, notifications were delayed by 20–30 minutes.
Some users received duplicate messages.
Others received no notifications at all.
Database CPU hit 100% and DB connections were exhausted.
An external SMS provider randomly timed out and caused sender crashes.

Architecture

Team whiteboard from design review — incomplete and misleading. Critical paths (queue, workers, SMS/push providers, retries) are missing.

The sketch on your whiteboard is the team’s incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident. Extend or replace it so it matches what you believe is really happening under load.

Impacted services

PostgreSQL (notifications DB)critical
CPU 100%, connections 498/500, poll queries dominating
notification-sendercritical
14 restarts in 10 min, OOM on SMS provider timeouts
Notification APIdegraded
Accepting requests but downstream backlog growing
External SMS providerdegraded
22% timeout rate during spike
Email / Push channelsdegraded
Backlogged behind same Postgres poll path
Customer supportcritical
Social complaints: triple confirmations, missing alerts

Flash sale · Notification pipeline

Metrics from the incident window

Notification APIPostgreSQLSMS providerSender pods

Notification delivery lag (p99)

Last 40mGrafana

Stable under 2s until flash sale at T+12m

PostgreSQL — CPU utilization

Last 40mGrafana

PostgreSQL — active connections

Last 40mGrafana

498 / 500

Pool exhausted; poll queries hold connections

Notification API — requests/min

Last 40mGrafana

ingresssuccessful enqueue

Kafka consumer lag (notification-send)

Last 40mGrafana

No dedicated queue — lag proxy from pending DB rows

External SMS provider — timeout rate

Last 40mGrafana

22%

Timeouts trigger immediate retries → sender OOM

notification-sender — pod restarts

Last 40mGrafana

14 in 10m

Failed deliveries by channel

Last 40mGrafana

SMS

18k

6.2k

Push

4.1k

In-app

890

SMS provider timeouts dominate; email/push backlogged behind DB poll

Duplicate notification rate

Last 40mGrafana

Sign in to view metrics

Triage questions

Answer with text, voice, or both — autosaved locally

Think pool sizing, queue decoupling, and whether Postgres should act as a work queue.

Both are included in your answer

Problem statement

Architecture

Impacted services

Sign in to view metrics

Sign in to answer questions