← Back to Real Engineering Stories

Real Engineering Stories

The Race Condition That Only Happened in Production

A production bug where a race condition in inventory management caused items to be oversold. The bug was impossible to reproduce in testing but happened frequently in production. Learn about race conditions, distributed locks, and debugging production-only bugs.

Advanced30 min read

This is a story about a bug that only happened in production, that we couldn't reproduce in testing, and that cost us money and customer trust. It's also about why concurrency bugs are so hard to find, and how we learned to think about race conditions from the start.


Context

We were running an e-commerce platform with an inventory management system. When customers purchased items, we needed to check inventory, reserve items, and update stock levels. The system handled about 1M purchase requests per day.

Original Architecture:

graph TB
    Client[Client] --> API[API Server]
    API --> Inventory[Inventory Service]
    Inventory --> DB[(Database)]
    API --> Payment[Payment Service]

Technology Choices:

  • API: Node.js with Express
  • Database: PostgreSQL with transactions
  • Inventory Service: Node.js microservice
  • Concurrency: Multiple API instances handling requests

Assumptions Made:

  • Database transactions would prevent race conditions
  • Inventory checks and updates would be atomic
  • High concurrency wouldn't cause issues

The Incident

Timeline:

  • Day 1: Feature deployed: real-time inventory updates
  • Day 3: First report of oversold item (1 item, dismissed as edge case)
  • Day 5: 5 reports of oversold items (investigation started)
  • Day 7: 20 reports of oversold items (bug confirmed)
  • Day 7, 2:00 PM: On-call engineer paged
  • Day 7, 2:30 PM: Attempted to reproduce bug (failed)
  • Day 7, 3:00 PM: Added logging to production
  • Day 7, 4:00 PM: Logs showed race condition pattern
  • Day 7, 5:00 PM: Identified race condition in inventory check
  • Day 7, 6:00 PM: Hotfix deployed (distributed lock)
  • Day 7, 7:00 PM: Bug fixed, but 50 items already oversold

Symptoms

What We Saw:

  • Oversold Items: Items sold beyond available inventory
  • Customer Complaints: Customers received "out of stock" after purchase
  • Inventory Discrepancies: Database showed negative inventory
  • Error Rate: No errors, but business logic failures
  • User Impact: ~50 customers affected, refunds required

How We Detected It:

  • Customer support reports of oversold items
  • Inventory audit showed negative stock levels
  • Payment succeeded but inventory check failed

Monitoring Gaps:

  • No alert for negative inventory
  • No alert for oversold items
  • No logging of inventory check/update sequence

Root Cause Analysis

Primary Cause: Race condition in inventory check and update.

The Bug:

// BAD CODE (simplified)
async function purchaseItem(userId, itemId, quantity) {
  // Step 1: Check inventory (not locked)
  const item = await db.query('SELECT stock FROM items WHERE id = ?', [itemId]);
  
  if (item.stock < quantity) {
    throw new Error('Insufficient stock');
  }
  
  // Step 2: Process payment (takes 2 seconds)
  await processPayment(userId, itemId, quantity);
  
  // Step 3: Update inventory (race condition here!)
  await db.query('UPDATE items SET stock = stock - ? WHERE id = ?', [quantity, itemId]);
}

What Happened:

  1. Two requests arrive simultaneously for the last item in stock
  2. Both requests check inventory at the same time (both see stock = 1)
  3. Both requests pass the inventory check
  4. Both requests process payment (both succeed)
  5. Both requests update inventory (stock becomes -1)
  6. Result: Item oversold, negative inventory

Why It Was So Bad:

  • No locking: Inventory check and update weren't atomic
  • Payment before inventory update: Payment processed before inventory reserved
  • High concurrency: Bug only appeared under load
  • Impossible to reproduce: Required exact timing, couldn't test

Contributing Factors:

  • No distributed locking mechanism
  • Payment processing took 2 seconds (window for race condition)
  • Multiple API instances handling concurrent requests
  • No transaction isolation for inventory operations

Fix & Mitigation

Immediate Fix:

// FIXED CODE
async function purchaseItem(userId, itemId, quantity) {
  // Use distributed lock to prevent race conditions
  const lock = await acquireLock(`inventory:${itemId}`);
  
  try {
    // Check and update inventory atomically
    const result = await db.query(
      'UPDATE items SET stock = stock - ? WHERE id = ? AND stock >= ?',
      [quantity, itemId, quantity]
    );
    
    if (result.affectedRows === 0) {
      throw new Error('Insufficient stock');
    }
    
    // Process payment (inventory already reserved)
    await processPayment(userId, itemId, quantity);
    
  } finally {
    await releaseLock(lock);
  }
}

Long-Term Improvements:

  1. Distributed Locking:

    • Implemented Redis-based distributed locks
    • Added lock timeout to prevent deadlocks
    • Added lock acquisition retry logic
  2. Atomic Operations:

    • Changed inventory update to atomic SQL (UPDATE with WHERE condition)
    • Moved inventory check into update query
    • Added optimistic locking with version numbers
  3. Monitoring & Alerting:

    • Added alert for negative inventory
    • Added alert for oversold items
    • Added logging of inventory operations
  4. Process Improvements:

    • Added concurrency testing to CI/CD
    • Added load testing with race condition scenarios
    • Created runbook for concurrency bugs

Architecture After Fix

graph TB
    Client[Client] --> API[API Server]
    API --> Lock[Distributed Lock<br/>Redis]
    API --> Inventory[Inventory Service]
    Inventory --> DB[(Database<br/>Atomic Updates)]
    API --> Payment[Payment Service]
    Lock --> Monitor[Lock Monitoring]

Key Changes:

  • Distributed locking for inventory operations
  • Atomic SQL updates (check and update in one query)
  • Inventory reserved before payment
  • Lock monitoring and alerting

Key Lessons

  1. Race conditions are timing-dependent: They only appear under specific concurrency conditions, making them hard to reproduce and test.

  2. Use distributed locks: For distributed systems, use distributed locks (Redis, etcd) to prevent race conditions across instances.

  3. Make operations atomic: Use atomic SQL operations (UPDATE with WHERE) instead of check-then-update patterns.

  4. Reserve before processing: Reserve inventory before processing payment, not after. This prevents overselling.

  5. Test under load: Race conditions only appear under high concurrency. Load test with concurrent requests.


Interview Takeaways

Common Questions:

  • "What is a race condition?"
  • "How do you prevent race conditions?"
  • "How do you debug production-only bugs?"

What Interviewers Are Looking For:

  • Understanding of race conditions and concurrency
  • Knowledge of distributed locking mechanisms
  • Experience with debugging production issues
  • Awareness of atomic operations

What a Senior Engineer Would Do Differently

From the Start:

  1. Use atomic operations: UPDATE with WHERE condition instead of check-then-update
  2. Add distributed locks: Use Redis locks for critical sections across instances
  3. Reserve before processing: Reserve inventory before payment, not after
  4. Add logging: Log all inventory operations to debug race conditions
  5. Test under load: Load test with concurrent requests to catch race conditions

The Real Lesson: Race conditions are invisible until they're not. Design for concurrency from the start—use locks, atomic operations, and test under load.


FAQs

Q: What is a race condition?

A: A race condition occurs when the outcome depends on the timing of events. Two or more operations access shared data concurrently, and the result depends on which operation completes first.

Q: How do you prevent race conditions?

A: Use locks (distributed locks for microservices), atomic operations (SQL UPDATE with WHERE), or transactional isolation. Design operations to be atomic from the start.

Q: Why are race conditions hard to reproduce?

A: Race conditions depend on exact timing of concurrent operations. They only appear under specific concurrency conditions, making them hard to reproduce in testing.

Q: How do you debug production-only race conditions?

A: Add extensive logging, use distributed tracing, monitor for patterns (like negative inventory), and use load testing to reproduce the conditions.

Q: Should you always use locks?

A: Not always. Locks add latency and complexity. Use atomic operations when possible, locks when necessary. Consider optimistic locking for low-contention scenarios.

Q: What's the difference between a race condition and a deadlock?

A: A race condition is when operations interfere with each other. A deadlock is when operations wait for each other indefinitely. Both are concurrency problems but different.

Q: How do you test for race conditions?

A: Load test with concurrent requests, use chaos engineering to introduce timing variations, and add logging to detect race condition patterns. Consider formal verification for critical systems.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.