This is a story about event-driven architecture's hidden failure mode: when your consumer falls behind, it doesn't just delay processing—it can create a burst of work that overwhelms downstream systems when it finally catches up.
Context
We had an order processing pipeline: orders → Kafka → consumer → database. Under normal load, consumer lag stayed under 1 minute. We'd never seen it exceed 5 minutes.
Architecture:
- Producer: Order service publishing to Kafka (order-created topic)
- Consumer: Single consumer group, 3 partitions, writing to PostgreSQL
- Database: Sized for 100 writes/second sustained
- Monitoring: Consumer lag alert at 10,000 messages (we'd never hit it)
Assumptions Made:
- Consumer could always catch up when traffic eased
- Database could handle any burst we'd realistically see
- Lag was a delay problem, not a blast-radius problem
The Incident
9:00 AM
Flash sale starts. Order rate: 10x normal (1000/min)
9:30 AM
Consumer lag: 15,000 messages. Database CPU 80%
10:00 AM
Consumer lag: 100,000 messages. Consumer struggling
3:00 PM
Flash sale ends. Lag peaked at 360,000 messages (6 hours of orders)
3:15 PM
Consumer catches up—sends 1000 writes/sec burst to DB. Database overwhelmed
3:20 PM
Database connection pool exhausted. Order processing fully down
4:00 PM
Added rate limiting to consumer. Gradual catch-up over 2 hours
6:00 PM
Lag cleared. Service restored. 3-hour additional outage from catch-up burst
Root Cause Analysis
Primary Cause: Consumer designed for steady-state, not burst catch-up. When lag cleared, it processed as fast as possible, overwhelming the database.
Key Insight: Lag isn't just delay—it's stored work. When released all at once, it becomes a stampede.
Fix & Mitigation
- Rate limiting on consumer: Max N messages/second, even when lag is high
- Consumer lag alerts: Alert at 1 min, 5 min, 15 min lag—not just message count
- Horizontal scaling: More consumer instances during high load
- Database protection: Connection limits, query rate limiting, bulk insert batching
- Circuit breaker: If DB is struggling, slow down consumer rather than keep pushing
Key Lessons
- Monitor consumer lag in time, not just count—100K messages could be 1 min or 6 hours
- Design for catch-up bursts—when lag clears, work arrives in a flood
- Backpressure: Consumer should slow down when downstream can't keep up
- Rate limit everything—even "catching up" needs to be throttled
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.