Celebrity post · one shard 100% CPU · feed p99 45s · fan-out backlog 2.1M
Social Feed · Incident brief
The Feed That Melted When a Celebrity Posted
Celebrity post · one shard 100% CPU · feed p99 45s · fan-out backlog 2.1M
Problem statement
A celebrity with 40M followers posted during peak. One timeline shard hit 100% CPU; feed p99 reached 45s. Fan-out backlog grew to 2.1M writes while other shards stayed idle.
Architecture uses uniform push fan-out with no celebrity tier or backpressure on the post path.
- One timeline shard hit 100% CPU while others stayed under 30%.
- Feed p99 latency 45s for affected users.
- Fan-out workers backlog grew to 2.1M writes.
- Hot user_id caused shard skew.
- Read cache could not warm before traffic arrived.
Live evidence
- TrendingT+0
@celebrity post went viral — 40M followers · fan-out queue depth 200× normal
- Shard monitorT+4m
Shard-7 CPU 100% · write QPS 50× peers · hot key detected
- Mobile errorsT+9m
Feed load failures 34% — timeouts on timeline-read for affected users
Architecture
Team whiteboard — incomplete. Missing paths implied by the incident.
The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.
Impacted services
- Timeline shard-7critical
CPU 100%; write queue depth 2.1M
- Fan-out workerscritical
Backlogged; lag climbing
- Feed read APIdegraded
p99 45s for affected users
- Other shardshealthy
CPU < 30% — skew visible