InterviewCrafted

Gateway 502 · downstream 'healthy' · thread pools saturated · retry amplification

Edge Platform · Incident brief

The API Gateway That Dropped Half of Production Traffic

Gateway 502 · downstream 'healthy' · thread pools saturated · retry amplification

Problem statement

The API gateway returned 502 for 48% of requests over 22 minutes. Inventory service latency spiked but TCP health checks still passed. Shared thread pools saturated; client retries doubled load.

Whiteboard shows one gateway pool fanning out to all routes with no bulkheads or timeouts.

  • Gateway returned 502 for 48% of requests over 22 minutes.
  • Downstream health checks still passed (shallow TCP).
  • Thread pools saturated; no queue limits.
  • Client retries doubled effective load.
  • One slow dependency blocked shared pool.

Live evidence

  • Synthetic probeT+5m

    502 rate 48% through gateway — upstream timeouts not propagated as backpressure

  • DeployT+0

    Inventory service deploy — thread pool shrink (undocumented in runbook)

  • SRE bridgeT+12m

    Gateway still accepting full traffic — no circuit breaker to failing upstream

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • API Gatewaycritical

    502 rate 48%; pools exhausted

  • Inventory servicedegraded

    p99 12s; root slow dependency

  • Orders servicedegraded

    Collateral latency from shared pool

  • All API clientscritical

    Half of requests failed