← Back to Real Engineering Stories

Real Engineering Stories

The Cache Stampede That Took Down Our API

A production incident where an accidental cache flush caused a cache stampede, overwhelming the database connection pool and taking down the API. Learn about cache stampedes, connection pool exhaustion, and how to prevent them.

Intermediate25 min read

This is a story about how a simple mistake—running a cache flush command in production instead of staging—caused a cascading failure that took down our API for 45 minutes. It's also a story about what we learned and how we changed our system design to prevent it from happening again.


Context

We were running a social media API that served user profiles, posts, and feeds. The system handled about 10M requests per day, with most traffic during peak hours (evenings). User profiles were cached in Redis with a 1-hour TTL to reduce database load.

Original Architecture:

graph TB
    Client[Client] --> LB[Load Balancer]
    LB --> API1[API Server 1]
    LB --> API2[API Server 2]
    LB --> API3[API Server 3]
    API1 --> Cache[Redis Cache]
    API2 --> Cache
    API3 --> Cache
    Cache --> DB[(PostgreSQL Database)]
    API1 --> DB
    API2 --> DB
    API3 --> DB

Technology Choices:

  • API Layer: Node.js with Express (3 instances behind load balancer)
  • Cache: Redis (single instance, 8GB memory)
  • Database: PostgreSQL (primary + 2 read replicas)
  • Caching Strategy: Cache-aside pattern with 1-hour TTL

Assumptions Made:

  • Cache hit rate would be > 80%
  • Database could handle cache misses during normal traffic
  • TTL expiration would be staggered (not all keys expire at once)

The Incident

Timeline:

  • 2:00 AM: Scheduled deployment completed successfully
  • 2:15 AM: Cache flush command accidentally executed (meant for staging, ran in production)
  • 2:16 AM: All cached user profiles invalidated
  • 2:17 AM: First requests hit API, cache misses trigger database queries
  • 2:18 AM: Database connection pool exhausted (100 connections max)
  • 2:19 AM: API response times spike from 50ms to 5+ seconds
  • 2:20 AM: Error rate jumps to 15%
  • 2:25 AM: On-call engineer paged
  • 2:30 AM: Database read replicas also overwhelmed
  • 2:35 AM: Decision to restart API servers (hoping to clear connection pool)
  • 2:40 AM: Restart causes brief service interruption
  • 2:45 AM: Cache warming script executed manually
  • 3:00 AM: Service fully recovered

Symptoms

What We Saw:

  • Error Rate: Spiked from 0.1% to 15% in 4 minutes
  • Response Time: p50 went from 50ms to 5 seconds, p99 exceeded 30 seconds
  • Database Connections: Hit 100/100 (pool exhausted)
  • Cache Hit Rate: Dropped from 85% to 0% (all keys expired)
  • User Impact: ~500K requests failed or timed out during the incident

How We Detected It:

  • Alert fired when error rate exceeded 5% threshold
  • Dashboard showed database connection pool at 100%
  • On-call engineer noticed spike in database query latency

Monitoring Gaps:

  • No alert for cache hit rate drops
  • No alert for database connection pool usage
  • No alert for cache flush operations

Root Cause Analysis

Primary Cause: Cache stampede after accidental cache flush.

What Happened:

  1. Cache flush invalidated all 2M cached user profiles
  2. Next 10,000 requests (within 1 minute) all resulted in cache misses
  3. Each cache miss triggered a database query
  4. Database connection pool (100 connections) was exhausted
  5. New requests waited for available connections, causing timeouts
  6. Timeouts caused retries, amplifying the load

Why It Was So Bad:

  • No cache warming: After flush, cache was empty
  • No connection pool monitoring: We didn't know we were hitting limits
  • No rate limiting: All requests tried to hit database simultaneously
  • No circuit breaker: System kept trying even when database was overwhelmed

Contributing Factors:

  • Single Redis instance (no redundancy)
  • Small database connection pool (100 connections for 3 API servers)
  • No cache stampede protection (no locking or probabilistic early expiration)
  • Manual cache flush command (should have been restricted or automated)

Fix & Mitigation

Immediate Fixes (During Incident):

  1. Restarted API servers: Cleared connection pool, gave database breathing room
  2. Manually warmed cache: Ran script to pre-populate top 100K user profiles
  3. Increased database connection pool: From 100 to 200 (temporary)

Long-Term Improvements:

  1. Cache Stampede Protection:

    • Implemented probabilistic early expiration (cache keys expire 10% early randomly)
    • Added distributed locking for cache misses (only one request per key fetches from DB)
    • Added cache warming on startup
  2. Database Connection Pool:

    • Increased pool size to 300 connections
    • Added connection pool monitoring and alerts
    • Implemented connection pool per API server (not shared)
  3. Monitoring & Alerting:

    • Added cache hit rate alert (alert if < 70%)
    • Added database connection pool usage alert (alert if > 80%)
    • Added cache flush operation audit log
  4. Process Improvements:

    • Restricted cache flush commands (require approval)
    • Added staging/production environment checks
    • Created runbook for cache-related incidents

Architecture After Fix

graph TB
    Client[Client] --> LB[Load Balancer]
    LB --> API1[API Server 1]
    LB --> API2[API Server 2]
    LB --> API3[API Server 3]
    API1 --> Cache1[Redis Cache<br/>Primary]
    API2 --> Cache1
    API3 --> Cache1
    Cache1 --> Cache2[Redis Cache<br/>Replica]
    API1 --> Lock[Distributed Lock<br/>Redis]
    API2 --> Lock
    API3 --> Lock
    Cache1 --> DB[(PostgreSQL<br/>Pool: 300)]
    API1 --> DB
    API2 --> DB
    API3 --> DB
    DB --> Monitor[Monitoring<br/>& Alerts]

Key Changes:

  • Redis replica for redundancy
  • Distributed locking for cache stampede protection
  • Larger database connection pool (300 connections)
  • Enhanced monitoring and alerting

Key Lessons

  1. Cache stampedes are real: When cache expires, all requests hit the database simultaneously. Use probabilistic early expiration or locking.

  2. Monitor connection pools: Database connection pools are a common bottleneck. Monitor usage and set alerts.

  3. Cache warming matters: After cache flushes or deployments, warm the cache with hot data.

  4. Restrict dangerous operations: Cache flush commands should require approval or be automated with safeguards.

  5. Circuit breakers help: When database is overwhelmed, circuit breakers prevent cascading failures.


Interview Takeaways

Common Questions:

  • "How do you prevent cache stampedes?"
  • "What happens when cache expires?"
  • "How do you handle database connection pool exhaustion?"

What Interviewers Are Looking For:

  • Understanding of cache-aside pattern and its failure modes
  • Knowledge of cache stampede mitigation strategies
  • Awareness of database connection pool limits
  • Experience with production incident response

What a Senior Engineer Would Do Differently

From the Start:

  1. Implement cache stampede protection: Probabilistic early expiration or distributed locking
  2. Monitor connection pools: Set up alerts before hitting limits
  3. Use cache warming: Pre-populate cache after deployments
  4. Add circuit breakers: Prevent cascading failures when database is overwhelmed
  5. Restrict dangerous operations: Cache flush should require approval

The Real Lesson: Caching is powerful, but cache invalidation is hard. Design for cache failures, not just cache hits.


FAQs

Q: How do you prevent cache stampedes?

A: Use probabilistic early expiration (cache keys expire 10% early randomly) or distributed locking (only one request per key fetches from database). Cache warming after deployments also helps.

Q: What's the best way to handle database connection pool exhaustion?

A: Monitor connection pool usage and set alerts before hitting limits. Increase pool size if needed, but also optimize queries and add caching to reduce database load.

Q: Should you always use caching?

A: Caching is powerful, but cache invalidation is hard. Use caching for read-heavy workloads, but design for cache failures. Not everything needs to be cached.

Q: How do you detect cache stampedes before they cause problems?

A: Monitor cache hit rate (alert if < 70%), database connection pool usage (alert if > 80%), and response times. Sudden drops in cache hit rate combined with increased database load indicate a potential stampede.

Q: What's the difference between cache stampede and thundering herd?

A: Cache stampede happens when cache expires and all requests hit the database. Thundering herd is when many requests try to refresh the same cache key simultaneously. Both can be prevented with locking or probabilistic early expiration.

Q: How do you warm a cache after a flush?

A: Pre-populate cache with hot data (most frequently accessed keys) before traffic hits. This can be done via a script that queries the database for top N items and stores them in cache.

Q: Is it better to use a single large connection pool or multiple smaller ones?

A: It depends on your architecture. For microservices, connection pools per service are better for isolation. For monoliths, a shared pool might be simpler. Monitor usage and adjust based on actual load.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.