This is a story about how a simple refactor—splitting payment processing into two services—introduced a deadlock that froze 200 transactions during Black Friday checkout. It's about lock ordering, transaction design, and why concurrency bugs only appear when it hurts most.
Context
Architecture:
- Payment Service A: Charges the customer, updates
payments table, then orders table
- Payment Service B: Validates inventory, updates
orders table, then inventory table
- Same order: Both services touch the
orders table, but in different sequences
- Lock order: Service A: payments → orders. Service B: orders → inventory. When both process same order: A locks payments, B locks orders... deadlock.
The Bug:
Service A: BEGIN; UPDATE payments; UPDATE orders; COMMIT;
Service B: BEGIN; UPDATE orders; UPDATE inventory; COMMIT;
Order X: A has payments lock, wants orders. B has orders lock, wants inventory.
But B also needs to update orders... and A has it. Circular wait. Deadlock.
The Incident
10:00 AM
Black Friday. Checkout volume 5x normal
10:23 AM
First deadlock detected. 2 transactions killed by DB
10:25 AM
Deadlock rate spiking. 50 transactions blocked. Users stuck on 'Processing payment'
10:28 AM
200 transactions hung. Payment system effectively down
10:31 AM
Identified deadlock. Emergency deploy: unified lock order across services
10:39 AM
Fix deployed. Backlog cleared. 8 minutes of payment freeze. $40K in abandoned carts
Root Cause Analysis
Primary Cause: Inconsistent lock ordering across services. Service A and B acquired locks in different orders when processing the same order, creating circular wait conditions.
Fix: Established global lock order: inventory → orders → payments. All services must acquire locks in this order. Documented in coding standards. Added integration test that simulates concurrent order processing.
Key Lessons
- Lock ordering: When multiple resources are involved, always acquire locks in the same order everywhere. Document it.
- Deadlock detection: Databases can detect and kill one victim—but under load, deadlocks cascade. Prevent, don't rely on detection.
- Short transactions: The shorter the transaction, the smaller the deadlock window. Do validation before opening transactions.
- Test under concurrency: Unit tests won't find deadlocks. You need load tests that simulate concurrent access patterns.
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.