InterviewCrafted

Retry storm · double charge · 2% of checkout · support flood

Payment Ops · Incident brief

The Payment Service That Double-Charged Customers

Retry storm · double charge · 2% of checkout · support flood

Live evidence

  • Stripe webhookT+5m

    Elevated capture timeouts (18%) — clients reporting 504 on /v1/capture

  • Finance SlackT+45m

    Reconciliation batch: $240k duplicate captures across 2% of checkouts

  • Support queueT+1h

    Ticket volume +400% — 'charged twice for same order'

Problem statement

After PSP timeouts, clients retried capture requests. 2% of checkouts were double-charged because the first capture often succeeded despite the timeout. Reconciliation found $240k in duplicates.

Architecture sketch shows sync PSP calls with no idempotency store and retries delegated to clients.

  • 2% of checkouts double-charged after PSP timeout + client retry.
  • Support tickets spiked 400% in 2 hours.
  • Retry storm increased PSP error rate to 18%.
  • No idempotency keys on capture endpoint.
  • Reconciliation batch found $240k in duplicate captures.

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • Payment APIcritical

    Duplicate captures; retry amplification

  • External PSPdegraded

    18% error rate under retry storm

  • PostgreSQLdegraded

    Lock contention on payment rows

  • Support / Financecritical

    400% ticket spike; manual refunds