Gateway 502 · downstream 'healthy' · thread pools saturated · retry amplification
Edge Platform · Incident brief
The API Gateway That Dropped Half of Production Traffic
Gateway 502 · downstream 'healthy' · thread pools saturated · retry amplification
Problem statement
The API gateway returned 502 for 48% of requests over 22 minutes. Inventory service latency spiked but TCP health checks still passed. Shared thread pools saturated; client retries doubled load.
Whiteboard shows one gateway pool fanning out to all routes with no bulkheads or timeouts.
- Gateway returned 502 for 48% of requests over 22 minutes.
- Downstream health checks still passed (shallow TCP).
- Thread pools saturated; no queue limits.
- Client retries doubled effective load.
- One slow dependency blocked shared pool.
Live evidence
- Synthetic probeT+5m
502 rate 48% through gateway — upstream timeouts not propagated as backpressure
- DeployT+0
Inventory service deploy — thread pool shrink (undocumented in runbook)
- SRE bridgeT+12m
Gateway still accepting full traffic — no circuit breaker to failing upstream
Architecture
Team whiteboard — incomplete. Missing paths implied by the incident.
The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.
Impacted services
- API Gatewaycritical
502 rate 48%; pools exhausted
- Inventory servicedegraded
p99 12s; root slow dependency
- Orders servicedegraded
Collateral latency from shared pool
- All API clientscritical
Half of requests failed