InterviewCrafted

Flash sale spike · notifications 28 min late · duplicates · DB at 100%

E-commerce Ops · Incident brief

The Notification System That Broke During a Flash Sale

Flash sale spike · notifications 28 min late · duplicates · DB at 100%

Problem statement

During a flash sale on a large e-commerce platform, the notification system failed under a ~60× traffic spike (5k → 300k notifications/min). Order confirmations and payment alerts were delayed 20–30 minutes. Some users received duplicate SMS and emails; others received nothing. PostgreSQL CPU hit 100% and connection pool exhausted (498/500). An external SMS provider timed out randomly, causing sender pod OOM crashes and retry storms.

The team's whiteboard shows only Client Apps → Notification API → Notification service → PostgreSQL. Delivery workers, message queues, idempotency keys, circuit breakers, and provider integrations were never drawn — but the incident proves they exist implicitly under load.

  • During a flash sale, notifications were delayed by 20–30 minutes.
  • Some users received duplicate messages.
  • Others received no notifications at all.
  • Database CPU hit 100% and DB connections were exhausted.
  • An external SMS provider randomly timed out and caused sender crashes.

Architecture

Team whiteboard from design review — incomplete and misleading. Critical paths (queue, workers, SMS/push providers, retries) are missing.

The sketch on your whiteboard is the team’s incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident. Extend or replace it so it matches what you believe is really happening under load.

Impacted services

  • PostgreSQL (notifications DB)critical

    CPU 100%, connections 498/500, poll queries dominating

  • notification-sendercritical

    14 restarts in 10 min, OOM on SMS provider timeouts

  • Notification APIdegraded

    Accepting requests but downstream backlog growing

  • External SMS providerdegraded

    22% timeout rate during spike

  • Email / Push channelsdegraded

    Backlogged behind same Postgres poll path

  • Customer supportcritical

    Social complaints: triple confirmations, missing alerts