← Back to Real Engineering Stories

Real Engineering Stories

The Message Queue Lag That Overwhelmed Our Order Processing

A Kafka consumer fell behind during a flash sale, causing 6 hours of message lag. When the consumer caught up, it overwhelmed the database with a burst of writes. Learn about backpressure, consumer lag monitoring, and graceful degradation.

Intermediate25 min read

This is a story about event-driven architecture's hidden failure mode: when your consumer falls behind, it doesn't just delay processing—it can create a burst of work that overwhelms downstream systems when it finally catches up.


Context

We had an order processing pipeline: orders → Kafka → consumer → database. Under normal load, consumer lag stayed under 1 minute. We'd never seen it exceed 5 minutes.

Architecture:

  • Producer: Order service publishing to Kafka (order-created topic)
  • Consumer: Single consumer group, 3 partitions, writing to PostgreSQL
  • Database: Sized for 100 writes/second sustained
  • Monitoring: Consumer lag alert at 10,000 messages (we'd never hit it)

Assumptions Made:

  • Consumer could always catch up when traffic eased
  • Database could handle any burst we'd realistically see
  • Lag was a delay problem, not a blast-radius problem

The Incident

9:00 AM
Flash sale starts. Order rate: 10x normal (1000/min)
9:30 AM
Consumer lag: 15,000 messages. Database CPU 80%
10:00 AM
Consumer lag: 100,000 messages. Consumer struggling
3:00 PM
Flash sale ends. Lag peaked at 360,000 messages (6 hours of orders)
3:15 PM
Consumer catches up—sends 1000 writes/sec burst to DB. Database overwhelmed
3:20 PM
Database connection pool exhausted. Order processing fully down
4:00 PM
Added rate limiting to consumer. Gradual catch-up over 2 hours
6:00 PM
Lag cleared. Service restored. 3-hour additional outage from catch-up burst

Root Cause Analysis

Primary Cause: Consumer designed for steady-state, not burst catch-up. When lag cleared, it processed as fast as possible, overwhelming the database.

Key Insight: Lag isn't just delay—it's stored work. When released all at once, it becomes a stampede.


Fix & Mitigation

  1. Rate limiting on consumer: Max N messages/second, even when lag is high
  2. Consumer lag alerts: Alert at 1 min, 5 min, 15 min lag—not just message count
  3. Horizontal scaling: More consumer instances during high load
  4. Database protection: Connection limits, query rate limiting, bulk insert batching
  5. Circuit breaker: If DB is struggling, slow down consumer rather than keep pushing

Key Lessons

  1. Monitor consumer lag in time, not just count—100K messages could be 1 min or 6 hours
  2. Design for catch-up bursts—when lag clears, work arrives in a flood
  3. Backpressure: Consumer should slow down when downstream can't keep up
  4. Rate limit everything—even "catching up" needs to be throttled

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.