InterviewCrafted

Reconnect storm · duplicate messages · 12% dup rate · gateway overload

Messaging · Incident brief

The Chat System That Sent Duplicate Messages

Reconnect storm · duplicate messages · 12% dup rate · gateway overload

Live evidence

  • Mobile crash reportT+6m

    iOS clients reconnecting in loop after brief network blip — resend on reconnect enabled

  • SupportT+15m

    Users report duplicate messages in group chats after app resume

  • Kafka lagT+22m

    Consumer lag spiked — dedup store lookups timing out under reconnect flood

Problem statement

After a mobile network blip, clients reconnected and retried unacked messages. Duplicate delivery rate hit 12%. The gateway saw a reconnect storm; the server never deduplicated on client message_id.

Whiteboard shows WebSocket → chat service → DB with no idempotency boundary or dedup store.

  • Duplicate message rate hit 12% after mobile network blip.
  • Reconnect storm overloaded connection gateway.
  • No server-side dedup on client message_id.
  • At-least-once delivery without idempotent consumers.
  • Message store write contention on hot threads.

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • WS Gatewaycritical

    Connection storm; CPU 92%

  • Chat servicedegraded

    Duplicate inserts; write contention

  • Message storedegraded

    Hot thread rows; lock wait up

  • End userscritical

    12% messages duplicated in UI