Cache flush · API pool exhausted · DB 100% · p99 30s
Platform SRE · Incident brief
The Cache That Took Down Our API
Cache flush · API pool exhausted · DB 100% · p99 30s
Problem statement
A coordinated Redis flush during peak traffic caused a cache miss storm. Miss rate hit 95% for 8 minutes. API connection pools exhausted and PostgreSQL CPU pegged at 100%. p99 latency reached 30s.
The whiteboard shows a simple cache-aside path with no singleflight, no origin protection, and no stale-while-revalidate — the incident exposes those gaps.
- Cache flush at peak traffic triggered 95% miss rate for 8 minutes.
- API connection pools exhausted (480/500).
- PostgreSQL CPU hit 100%; p99 API latency climbed to 30s.
- Circuit breakers did not trip — origin kept accepting traffic.
- Partial recovery after manual rate limit, but errors persisted 40 minutes.
Live evidence
- Deploy botT+0
Coordinated Redis FLUSHALL completed on catalog cluster (change #8842)
- DatadogT+6m
Cache miss rate 95% · API pool 480/500 · Postgres CPU pegged
- Slack #platformT+12m
Manual rate limit applied — still seeing 30s p99 on reads
Architecture
Team whiteboard — incomplete. Missing paths implied by the incident.
The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.
Impacted services
- Redis clusterdegraded
Cold after flush; hit rate collapsed
- Product APIcritical
Pool 480/500; timeouts cascading
- PostgreSQLcritical
CPU 100%; read QPS 8× baseline
- Downstream clientscritical
Error rate 34% at peak