InterviewCrafted

Cache flush · API pool exhausted · DB 100% · p99 30s

Platform SRE · Incident brief

The Cache That Took Down Our API

Cache flush · API pool exhausted · DB 100% · p99 30s

Problem statement

A coordinated Redis flush during peak traffic caused a cache miss storm. Miss rate hit 95% for 8 minutes. API connection pools exhausted and PostgreSQL CPU pegged at 100%. p99 latency reached 30s.

The whiteboard shows a simple cache-aside path with no singleflight, no origin protection, and no stale-while-revalidate — the incident exposes those gaps.

  • Cache flush at peak traffic triggered 95% miss rate for 8 minutes.
  • API connection pools exhausted (480/500).
  • PostgreSQL CPU hit 100%; p99 API latency climbed to 30s.
  • Circuit breakers did not trip — origin kept accepting traffic.
  • Partial recovery after manual rate limit, but errors persisted 40 minutes.

Live evidence

  • Deploy botT+0

    Coordinated Redis FLUSHALL completed on catalog cluster (change #8842)

  • DatadogT+6m

    Cache miss rate 95% · API pool 480/500 · Postgres CPU pegged

  • Slack #platformT+12m

    Manual rate limit applied — still seeing 30s p99 on reads

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • Redis clusterdegraded

    Cold after flush; hit rate collapsed

  • Product APIcritical

    Pool 480/500; timeouts cascading

  • PostgreSQLcritical

    CPU 100%; read QPS 8× baseline

  • Downstream clientscritical

    Error rate 34% at peak