Reconnect storm · duplicate messages · 12% dup rate · gateway overload
Messaging · Incident brief
The Chat System That Sent Duplicate Messages
Reconnect storm · duplicate messages · 12% dup rate · gateway overload
Live evidence
- Mobile crash reportT+6m
iOS clients reconnecting in loop after brief network blip — resend on reconnect enabled
- SupportT+15m
Users report duplicate messages in group chats after app resume
- Kafka lagT+22m
Consumer lag spiked — dedup store lookups timing out under reconnect flood
Problem statement
After a mobile network blip, clients reconnected and retried unacked messages. Duplicate delivery rate hit 12%. The gateway saw a reconnect storm; the server never deduplicated on client message_id.
Whiteboard shows WebSocket → chat service → DB with no idempotency boundary or dedup store.
- Duplicate message rate hit 12% after mobile network blip.
- Reconnect storm overloaded connection gateway.
- No server-side dedup on client message_id.
- At-least-once delivery without idempotent consumers.
- Message store write contention on hot threads.
Architecture
Team whiteboard — incomplete. Missing paths implied by the incident.
The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.
Impacted services
- WS Gatewaycritical
Connection storm; CPU 92%
- Chat servicedegraded
Duplicate inserts; write contention
- Message storedegraded
Hot thread rows; lock wait up
- End userscritical
12% messages duplicated in UI