InterviewCrafted

Celebrity post · one shard 100% CPU · feed p99 45s · fan-out backlog 2.1M

Social Feed · Incident brief

The Feed That Melted When a Celebrity Posted

Celebrity post · one shard 100% CPU · feed p99 45s · fan-out backlog 2.1M

Problem statement

A celebrity with 40M followers posted during peak. One timeline shard hit 100% CPU; feed p99 reached 45s. Fan-out backlog grew to 2.1M writes while other shards stayed idle.

Architecture uses uniform push fan-out with no celebrity tier or backpressure on the post path.

  • One timeline shard hit 100% CPU while others stayed under 30%.
  • Feed p99 latency 45s for affected users.
  • Fan-out workers backlog grew to 2.1M writes.
  • Hot user_id caused shard skew.
  • Read cache could not warm before traffic arrived.

Live evidence

  • TrendingT+0

    @celebrity post went viral — 40M followers · fan-out queue depth 200× normal

  • Shard monitorT+4m

    Shard-7 CPU 100% · write QPS 50× peers · hot key detected

  • Mobile errorsT+9m

    Feed load failures 34% — timeouts on timeline-read for affected users

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • Timeline shard-7critical

    CPU 100%; write queue depth 2.1M

  • Fan-out workerscritical

    Backlogged; lag climbing

  • Feed read APIdegraded

    p99 45s for affected users

  • Other shardshealthy

    CPU < 30% — skew visible