Real Engineering Stories

The Cache Stampede That Took Down Our API

A production incident where an accidental cache flush caused a cache stampede, overwhelming the database connection pool and taking down the API. Learn about cache stampedes, connection pool exhaustion, and how to prevent them.

Intermediate25 min read

This is a story about how a simple mistake—running a cache flush command in production instead of staging—caused a cascading failure that took down our API for 45 minutes. It's also a story about what we learned and how we changed our system design to prevent it from happening again.

Context

We were running a social media API that served user profiles, posts, and feeds. The system handled about 10M requests per day, with most traffic during peak hours (evenings). User profiles were cached in Redis with a 1-hour TTL to reduce database load.

Original Architecture:

Technology Choices:

API Layer: Node.js with Express (3 instances behind load balancer)
Cache: Redis (single instance, 8GB memory)
Database: PostgreSQL (primary + 2 read replicas)
Caching Strategy: Cache-aside pattern with 1-hour TTL

Assumptions Made:

Cache hit rate would be > 80%
Database could handle cache misses during normal traffic
TTL expiration would be staggered (not all keys expire at once)

The Incident

2:00 AM

Scheduled deployment completed successfully

2:15 AM

Cache flush command accidentally executed (meant for staging, ran in production)

2:16 AM

All cached user profiles invalidated

2:17 AM

First requests hit API, cache misses trigger database queries

2:18 AM

Database connection pool exhausted (100 connections max)

2:19 AM

API response times spike from 50ms to 5+ seconds

2:20 AM

Error rate jumps to 15%

2:25 AM

On-call engineer paged

2:30 AM

Database read replicas also overwhelmed

2:35 AM

Decision to restart API servers (hoping to clear connection pool)

2:40 AM

Restart causes brief service interruption

2:45 AM

Cache warming script executed manually

3:00 AM

Service fully recovered

Symptoms

What We Saw:

Error Rate: Spiked from 0.1% to 15% in 4 minutes
Response Time: p50 went from 50ms to 5 seconds, p99 exceeded 30 seconds
Database Connections: Hit 100/100 (pool exhausted)
Cache Hit Rate: Dropped from 85% to 0% (all keys expired)
User Impact: ~500K requests failed or timed out during the incident

How We Detected It:

Alert fired when error rate exceeded 5% threshold
Dashboard showed database connection pool at 100%
On-call engineer noticed spike in database query latency

Monitoring Gaps:

No alert for cache hit rate drops
No alert for database connection pool usage
No alert for cache flush operations

Root Cause Analysis

Primary Cause: Cache stampede after accidental cache flush.

What Happened:

Cache flush invalidated all 2M cached user profiles
Next 10,000 requests (within 1 minute) all resulted in cache misses
Each cache miss triggered a database query
Database connection pool (100 connections) was exhausted
New requests waited for available connections, causing timeouts
Timeouts caused retries, amplifying the load

Why It Was So Bad:

No cache warming: After flush, cache was empty
No connection pool monitoring: We didn't know we were hitting limits
No rate limiting: All requests tried to hit database simultaneously
No circuit breaker: System kept trying even when database was overwhelmed

Contributing Factors:

Single Redis instance (no redundancy)
Small database connection pool (100 connections for 3 API servers)
No cache stampede protection (no locking or probabilistic early expiration)
Manual cache flush command (should have been restricted or automated)

Fix & Mitigation

Immediate Fixes (During Incident):

Restarted API servers: Cleared connection pool, gave database breathing room
Manually warmed cache: Ran script to pre-populate top 100K user profiles
Increased database connection pool: From 100 to 200 (temporary)

Long-Term Improvements:

Cache Stampede Protection:
- Implemented probabilistic early expiration (cache keys expire 10% early randomly)
- Added distributed locking for cache misses (only one request per key fetches from DB)
- Added cache warming on startup
Database Connection Pool:
- Increased pool size to 300 connections
- Added connection pool monitoring and alerts
- Implemented connection pool per API server (not shared)
Monitoring & Alerting:
- Added cache hit rate alert (alert if < 70%)
- Added database connection pool usage alert (alert if > 80%)
- Added cache flush operation audit log
Process Improvements:
- Restricted cache flush commands (require approval)
- Added staging/production environment checks
- Created runbook for cache-related incidents

Architecture After Fix

Key Changes:

Redis replica for redundancy
Distributed locking for cache stampede protection
Larger database connection pool (300 connections)
Enhanced monitoring and alerting

Key Lessons

Cache stampedes are real: When cache expires, all requests hit the database simultaneously. Use probabilistic early expiration or locking.
Monitor connection pools: Database connection pools are a common bottleneck. Monitor usage and set alerts.
Cache warming matters: After cache flushes or deployments, warm the cache with hot data.
Restrict dangerous operations: Cache flush commands should require approval or be automated with safeguards.
Circuit breakers help: When database is overwhelmed, circuit breakers prevent cascading failures.

Interview Takeaways

Common Questions:

"How do you prevent cache stampedes?"
"What happens when cache expires?"
"How do you handle database connection pool exhaustion?"

What Interviewers Are Looking For:

Understanding of cache-aside pattern and its failure modes
Knowledge of cache stampede mitigation strategies
Awareness of database connection pool limits
Experience with production incident response

What a Senior Engineer Would Do Differently

From the Start:

Implement cache stampede protection: Probabilistic early expiration or distributed locking
Monitor connection pools: Set up alerts before hitting limits
Use cache warming: Pre-populate cache after deployments
Add circuit breakers: Prevent cascading failures when database is overwhelmed
Restrict dangerous operations: Cache flush should require approval

The Real Lesson: Caching is powerful, but cache invalidation is hard. Design for cache failures, not just cache hits.

FAQs

Q: How do you prevent cache stampedes?

A: Use probabilistic early expiration (cache keys expire 10% early randomly) or distributed locking (only one request per key fetches from database). Cache warming after deployments also helps.

Q: What's the best way to handle database connection pool exhaustion?

A: Monitor connection pool usage and set alerts before hitting limits. Increase pool size if needed, but also optimize queries and add caching to reduce database load.

Q: Should you always use caching?

A: Caching is powerful, but cache invalidation is hard. Use caching for read-heavy workloads, but design for cache failures. Not everything needs to be cached.

Q: How do you detect cache stampedes before they cause problems?

A: Monitor cache hit rate (alert if < 70%), database connection pool usage (alert if > 80%), and response times. Sudden drops in cache hit rate combined with increased database load indicate a potential stampede.

Q: What's the difference between cache stampede and thundering herd?

A: Cache stampede happens when cache expires and all requests hit the database. Thundering herd is when many requests try to refresh the same cache key simultaneously. Both can be prevented with locking or probabilistic early expiration.

Q: How do you warm a cache after a flush?

A: Pre-populate cache with hot data (most frequently accessed keys) before traffic hits. This can be done via a script that queries the database for top N items and stores them in cache.

Q: Is it better to use a single large connection pool or multiple smaller ones?

A: It depends on your architecture. For microservices, connection pools per service are better for isolation. For monoliths, a shared pool might be simpler. Monitor usage and adjust based on actual load.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design a URL Shortener (TinyURL)

Apply caching lessons from this story to design a resilient URL shortener with proper cache management.

Easy

Design a Notification System

Use insights from cache stampede prevention to design a robust notification system.

Medium

Explore More Practice Questions →

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →