InterviewCrafted

One API key · 80% capacity · tenants starved · 429 storm

API Platform · Incident brief

The Rate Limiter That Let One Tenant DDoS the API

One API key · 80% capacity · tenants starved · 429 storm

Problem statement

A misconfigured enterprise API key consumed 80% of gateway capacity. Other tenants hit 429s. Redis-backed counters drifted across gateway nodes; a brief Redis blip fail-opened all limits.

Whiteboard shows a single token bucket per key with no tenant tiers or route-level limits.

  • One API key consumed 80% of gateway capacity for 45 minutes.
  • Other tenants saw 429 storms and elevated p99.
  • Redis limit drift between nodes caused uneven enforcement.
  • Fail-open on Redis blip briefly removed all limits.
  • No per-route limits — only per-key global bucket.

Live evidence

  • Tenant alertT+10m

    Enterprise tenant AcmeCorp: 429 rate 0% while shared tier saturated

  • GrafanaT+18m

    Single tenant consuming 78% of gateway capacity — global limit never tripped

  • On-call noteT+25m

    Limit configured globally (10k rps) — no per-tenant bucket in config

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • API Gatewaycritical

    Unfair capacity allocation

  • Redis (limiter)degraded

    Drift + brief outage fail-open

  • Other tenantscritical

    429 rate 22% on shared pool

  • Backend servicesdegraded

    Elevated p99 from overload