Notification delivery lag (p99)
Stable under 2s until flash sale at T+12m
Flash sale spike · notifications 28 min late · duplicates · DB at 100%
E-commerce Ops · Incident brief
Flash sale spike · notifications 28 min late · duplicates · DB at 100%
During a flash sale on a large e-commerce platform, the notification system failed under a ~60× traffic spike (5k → 300k notifications/min). Order confirmations and payment alerts were delayed 20–30 minutes. Some users received duplicate SMS and emails; others received nothing. PostgreSQL CPU hit 100% and connection pool exhausted (498/500). An external SMS provider timed out randomly, causing sender pod OOM crashes and retry storms.
The team's whiteboard shows only Client Apps → Notification API → Notification service → PostgreSQL. Delivery workers, message queues, idempotency keys, circuit breakers, and provider integrations were never drawn — but the incident proves they exist implicitly under load.
Team whiteboard from design review — incomplete and misleading. Critical paths (queue, workers, SMS/push providers, retries) are missing.
The sketch on your whiteboard is the team’s incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident. Extend or replace it so it matches what you believe is really happening under load.
CPU 100%, connections 498/500, poll queries dominating
14 restarts in 10 min, OOM on SMS provider timeouts
Accepting requests but downstream backlog growing
22% timeout rate during spike
Backlogged behind same Postgres poll path
Social complaints: triple confirmations, missing alerts