Platform SRE · Incident brief

The Cache That Took Down Our API

Cache flush · API pool exhausted · DB 100% · p99 30s

Problem statement

A coordinated Redis flush during peak traffic caused a cache miss storm. Miss rate hit 95% for 8 minutes. API connection pools exhausted and PostgreSQL CPU pegged at 100%. p99 latency reached 30s.

The whiteboard shows a simple cache-aside path with no singleflight, no origin protection, and no stale-while-revalidate — the incident exposes those gaps.

Cache flush at peak traffic triggered 95% miss rate for 8 minutes.
API connection pools exhausted (480/500).
PostgreSQL CPU hit 100%; p99 API latency climbed to 30s.
Circuit breakers did not trip — origin kept accepting traffic.
Partial recovery after manual rate limit, but errors persisted 40 minutes.

Live evidence

Deploy botT+0
Coordinated Redis FLUSHALL completed on catalog cluster (change #8842)
DatadogT+6m
Cache miss rate 95% · API pool 480/500 · Postgres CPU pegged
Slack #platformT+12m
Manual rate limit applied — still seeing 30s p99 on reads

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

Redis clusterdegraded
Cold after flush; hit rate collapsed
Product APIcritical
Pool 480/500; timeouts cascading
PostgreSQLcritical
CPU 100%; read QPS 8× baseline
Downstream clientscritical
Error rate 34% at peak

Sign in to view metrics

Log in to explore Grafana panels, answer triage questions, and submit your analysis for AI review.

Triage questions

Answer with text, voice, or both — autosaved locally

Think coalescing, backpressure, and origin limits.

Both are included in your answer

Sign in to answer questions

Log in to write or record triage answers, then submit for personalized feedback on your incident analysis.

The Cache That Took Down Our API

Problem statement

Live evidence

Architecture

Impacted services

Cache layer · Redis & origin load

Cache miss rate

API p99 latency

PostgreSQL CPU

API connection pool

Origin read QPS vs baseline

Sign in to view metrics

Triage questions

Sign in to answer questions