One API key · 80% capacity · tenants starved · 429 storm
API Platform · Incident brief
The Rate Limiter That Let One Tenant DDoS the API
One API key · 80% capacity · tenants starved · 429 storm
Problem statement
A misconfigured enterprise API key consumed 80% of gateway capacity. Other tenants hit 429s. Redis-backed counters drifted across gateway nodes; a brief Redis blip fail-opened all limits.
Whiteboard shows a single token bucket per key with no tenant tiers or route-level limits.
- One API key consumed 80% of gateway capacity for 45 minutes.
- Other tenants saw 429 storms and elevated p99.
- Redis limit drift between nodes caused uneven enforcement.
- Fail-open on Redis blip briefly removed all limits.
- No per-route limits — only per-key global bucket.
Live evidence
- Tenant alertT+10m
Enterprise tenant AcmeCorp: 429 rate 0% while shared tier saturated
- GrafanaT+18m
Single tenant consuming 78% of gateway capacity — global limit never tripped
- On-call noteT+25m
Limit configured globally (10k rps) — no per-tenant bucket in config
Architecture
Team whiteboard — incomplete. Missing paths implied by the incident.
The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.
Impacted services
- API Gatewaycritical
Unfair capacity allocation
- Redis (limiter)degraded
Drift + brief outage fail-open
- Other tenantscritical
429 rate 22% on shared pool
- Backend servicesdegraded
Elevated p99 from overload