Real Engineering Stories

The Race Condition That Only Happened in Production

A production bug where a race condition in inventory management caused items to be oversold. The bug was impossible to reproduce in testing but happened frequently in production. Learn about race conditions, distributed locks, and debugging production-only bugs.

Advanced30 min read

This is a story about a bug that only happened in production, that we couldn't reproduce in testing, and that cost us money and customer trust. It's also about why concurrency bugs are so hard to find, and how we learned to think about race conditions from the start.

Context

We were running an e-commerce platform with an inventory management system. When customers purchased items, we needed to check inventory, reserve items, and update stock levels. The system handled about 1M purchase requests per day.

Original Architecture:

Technology Choices:

API: Node.js with Express
Database: PostgreSQL with transactions
Inventory Service: Node.js microservice
Concurrency: Multiple API instances handling requests

Assumptions Made:

Database transactions would prevent race conditions
Inventory checks and updates would be atomic
High concurrency wouldn't cause issues

The Incident

Day 1

Feature deployed: real-time inventory updates

Day 3

First report of oversold item (1 item, dismissed as edge case)

Day 5

5 reports of oversold items (investigation started)

Day 7

20 reports of oversold items (bug confirmed)

Day 7, 2:00 PM

On-call engineer paged

Day 7, 2:30 PM

Attempted to reproduce bug (failed)

Day 7, 3:00 PM

Added logging to production

Day 7, 4:00 PM

Logs showed race condition pattern

Day 7, 5:00 PM

Identified race condition in inventory check

Day 7, 6:00 PM

Hotfix deployed (distributed lock)

Day 7, 7:00 PM

Bug fixed, but 50 items already oversold

Symptoms

What We Saw:

Oversold Items: Items sold beyond available inventory
Customer Complaints: Customers received "out of stock" after purchase
Inventory Discrepancies: Database showed negative inventory
Error Rate: No errors, but business logic failures
User Impact: ~50 customers affected, refunds required

How We Detected It:

Customer support reports of oversold items
Inventory audit showed negative stock levels
Payment succeeded but inventory check failed

Monitoring Gaps:

No alert for negative inventory
No alert for oversold items
No logging of inventory check/update sequence

Root Cause Analysis

Primary Cause: Race condition in inventory check and update.

The Bug:

// BAD CODE (simplified)
async function purchaseItem(userId, itemId, quantity) {
  // Step 1: Check inventory (not locked)
  const item = await db.query('SELECT stock FROM items WHERE id = ?', [itemId]);
  
  if (item.stock < quantity) {
    throw new Error('Insufficient stock');
  }
  
  // Step 2: Process payment (takes 2 seconds)
  await processPayment(userId, itemId, quantity);
  
  // Step 3: Update inventory (race condition here!)
  await db.query('UPDATE items SET stock = stock - ? WHERE id = ?', [quantity, itemId]);
}

What Happened:

Two requests arrive simultaneously for the last item in stock
Both requests check inventory at the same time (both see stock = 1)
Both requests pass the inventory check
Both requests process payment (both succeed)
Both requests update inventory (stock becomes -1)
Result: Item oversold, negative inventory

Why It Was So Bad:

No locking: Inventory check and update weren't atomic
Payment before inventory update: Payment processed before inventory reserved
High concurrency: Bug only appeared under load
Impossible to reproduce: Required exact timing, couldn't test

Contributing Factors:

No distributed locking mechanism
Payment processing took 2 seconds (window for race condition)
Multiple API instances handling concurrent requests
No transaction isolation for inventory operations

Fix & Mitigation

Immediate Fix:

// FIXED CODE
async function purchaseItem(userId, itemId, quantity) {
  // Use distributed lock to prevent race conditions
  const lock = await acquireLock(`inventory:${itemId}`);
  
  try {
    // Check and update inventory atomically
    const result = await db.query(
      'UPDATE items SET stock = stock - ? WHERE id = ? AND stock >= ?',
      [quantity, itemId, quantity]
    );
    
    if (result.affectedRows === 0) {
      throw new Error('Insufficient stock');
    }
    
    // Process payment (inventory already reserved)
    await processPayment(userId, itemId, quantity);
    
  } finally {
    await releaseLock(lock);
  }
}

Long-Term Improvements:

Distributed Locking:
- Implemented Redis-based distributed locks
- Added lock timeout to prevent deadlocks
- Added lock acquisition retry logic
Atomic Operations:
- Changed inventory update to atomic SQL (UPDATE with WHERE condition)
- Moved inventory check into update query
- Added optimistic locking with version numbers
Monitoring & Alerting:
- Added alert for negative inventory
- Added alert for oversold items
- Added logging of inventory operations
Process Improvements:
- Added concurrency testing to CI/CD
- Added load testing with race condition scenarios
- Created runbook for concurrency bugs

Architecture After Fix

Key Changes:

Distributed locking for inventory operations
Atomic SQL updates (check and update in one query)
Inventory reserved before payment
Lock monitoring and alerting

Key Lessons

Race conditions are timing-dependent: They only appear under specific concurrency conditions, making them hard to reproduce and test.
Use distributed locks: For distributed systems, use distributed locks (Redis, etcd) to prevent race conditions across instances.
Make operations atomic: Use atomic SQL operations (UPDATE with WHERE) instead of check-then-update patterns.
Reserve before processing: Reserve inventory before processing payment, not after. This prevents overselling.
Test under load: Race conditions only appear under high concurrency. Load test with concurrent requests.

Interview Takeaways

Common Questions:

"What is a race condition?"
"How do you prevent race conditions?"
"How do you debug production-only bugs?"

What Interviewers Are Looking For:

Understanding of race conditions and concurrency
Knowledge of distributed locking mechanisms
Experience with debugging production issues
Awareness of atomic operations

What a Senior Engineer Would Do Differently

From the Start:

Use atomic operations: UPDATE with WHERE condition instead of check-then-update
Add distributed locks: Use Redis locks for critical sections across instances
Reserve before processing: Reserve inventory before payment, not after
Add logging: Log all inventory operations to debug race conditions
Test under load: Load test with concurrent requests to catch race conditions

The Real Lesson: Race conditions are invisible until they're not. Design for concurrency from the start—use locks, atomic operations, and test under load.

FAQs

Q: What is a race condition?

A: A race condition occurs when the outcome depends on the timing of events. Two or more operations access shared data concurrently, and the result depends on which operation completes first.

Q: How do you prevent race conditions?

A: Use locks (distributed locks for microservices), atomic operations (SQL UPDATE with WHERE), or transactional isolation. Design operations to be atomic from the start.

Q: Why are race conditions hard to reproduce?

A: Race conditions depend on exact timing of concurrent operations. They only appear under specific concurrency conditions, making them hard to reproduce in testing.

Q: How do you debug production-only race conditions?

A: Add extensive logging, use distributed tracing, monitor for patterns (like negative inventory), and use load testing to reproduce the conditions.

Q: Should you always use locks?

A: Not always. Locks add latency and complexity. Use atomic operations when possible, locks when necessary. Consider optimistic locking for low-contention scenarios.

Q: What's the difference between a race condition and a deadlock?

A: A race condition is when operations interfere with each other. A deadlock is when operations wait for each other indefinitely. Both are concurrency problems but different.

Q: How do you test for race conditions?

A: Load test with concurrent requests, use chaos engineering to introduce timing variations, and add logging to detect race condition patterns. Consider formal verification for critical systems.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design a Notification System

Apply concurrency lessons to design a notification system that handles concurrent requests safely.

Medium

Design a URL Shortener (TinyURL)

Use insights from race condition prevention to design a URL shortener with proper concurrency handling.

Easy

Explore More Practice Questions →

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →