Real Engineering Stories
The Circuit Breaker That Didn't Break
A production incident where a misconfigured circuit breaker allowed cascading failures to propagate, taking down multiple services. Learn about circuit breaker patterns, failure isolation, and resilience.
This is a story about how a circuit breaker that was supposed to prevent failures actually made them worse. It's also about why understanding resilience patterns matters, and how we learned that configuration is as important as implementation.
Context
We had a microservices architecture with a payment service that processed transactions. The payment service called an external payment gateway API. We implemented a circuit breaker to prevent cascading failures if the gateway was down.
Original Architecture:
graph TB
Client[Client] --> API[API Gateway]
API --> Payment[Payment Service]
Payment --> CircuitBreaker[Circuit Breaker]
CircuitBreaker --> Gateway[External Payment Gateway]
Payment --> DB[(Database)]
Technology Choices:
- Payment Service: Node.js microservice
- Circuit Breaker: Opossum library (Node.js)
- External Gateway: Third-party payment API
- Database: PostgreSQL for transaction records
Assumptions Made:
- Circuit breaker would prevent cascading failures
- Configuration was correct (we copied from documentation)
- Circuit breaker would open quickly when gateway failed
The Incident
Timeline:
- 11:00 AM: External payment gateway started experiencing issues (high latency)
- 11:05 AM: Payment service requests started timing out
- 11:06 AM: Circuit breaker should have opened, but didn't
- 11:07 AM: Payment service threads blocked waiting for gateway
- 11:08 AM: Payment service stopped processing new requests
- 11:09 AM: API Gateway couldn't reach payment service
- 11:10 AM: Checkout flow completely broken
- 11:12 AM: On-call engineer paged
- 11:15 AM: Identified circuit breaker misconfiguration
- 11:20 AM: Circuit breaker configuration fixed
- 11:25 AM: Service recovered, but 15 minutes of transactions lost
Symptoms
What We Saw:
- Payment Service: Stopped responding to requests
- Error Rate: Jumped from 0.1% to 100% for payment requests
- Response Time: All payment requests timing out
- Thread Pool: All threads blocked waiting for gateway
- User Impact: ~50K checkout attempts failed, revenue loss
How We Detected It:
- Alert fired when payment service stopped responding
- Dashboard showed all payment requests failing
- External gateway status page showed issues
Monitoring Gaps:
- No alert for circuit breaker state
- No alert for thread pool exhaustion
- No alert for external service latency
Root Cause Analysis
Primary Cause: Misconfigured circuit breaker threshold.
The Bug:
// BAD CONFIGURATION
const circuitBreaker = new CircuitBreaker(callGateway, {
timeout: 5000, // 5 second timeout
errorThresholdPercentage: 50, // Open after 50% errors
resetTimeout: 30000, // Reset after 30 seconds
// PROBLEM: Only 2 requests needed to trigger 50% threshold!
// If 1 request fails, next request opens circuit
// But we have 100 concurrent requests...
});
What Happened:
- External gateway started experiencing high latency (5+ seconds)
- Payment service had 100 concurrent requests to gateway
- Circuit breaker error threshold was 50% (too low for high concurrency)
- First 50 requests timed out, circuit should have opened
- But circuit breaker only tracked last 10 requests (default window)
- With 100 concurrent requests, circuit breaker couldn't track properly
- Circuit breaker never opened, all requests kept trying
- All payment service threads blocked waiting for gateway
- Payment service stopped processing new requests
- Cascading failure propagated to API Gateway
Why It Was So Bad:
- Wrong threshold: 50% error threshold too low for high concurrency
- Small window: Only tracking last 10 requests, not all concurrent requests
- No thread pool limits: All threads blocked, no capacity for new requests
- No fallback: No graceful degradation when circuit should open
Contributing Factors:
- Configuration copied from documentation without understanding
- No testing of circuit breaker under failure scenarios
- No monitoring of circuit breaker state
- High concurrency (100+ requests) not considered in configuration
Fix & Mitigation
Immediate Fix:
// FIXED CONFIGURATION
const circuitBreaker = new CircuitBreaker(callGateway, {
timeout: 2000, // 2 second timeout (fail fast)
errorThresholdPercentage: 80, // Open after 80% errors (higher threshold)
resetTimeout: 60000, // Reset after 60 seconds
volumeThreshold: 20, // Need at least 20 requests before opening
rollingCountTimeout: 10000, // 10 second window
rollingCountBuckets: 10, // 10 buckets for better tracking
// Add fallback
fallback: () => {
return { error: 'Payment service temporarily unavailable' };
}
});
Long-Term Improvements:
-
Circuit Breaker Configuration:
- Increased error threshold to 80% (more appropriate for high concurrency)
- Added volume threshold (need minimum requests before opening)
- Added proper rolling window configuration
- Added fallback mechanism
-
Thread Pool Management:
- Added thread pool size limits
- Added thread pool monitoring
- Added alert for thread pool exhaustion
-
Monitoring & Alerting:
- Added circuit breaker state monitoring
- Added alert when circuit opens
- Added external service latency monitoring
- Added fallback usage tracking
-
Process Improvements:
- Required circuit breaker testing in staging
- Added resilience testing to CI/CD
- Created runbook for circuit breaker incidents
- Added configuration review process
Architecture After Fix
graph TB
Client[Client] --> API[API Gateway]
API --> Payment[Payment Service<br/>Thread Pool Limit]
Payment --> CircuitBreaker[Circuit Breaker<br/>Proper Config]
CircuitBreaker --> Gateway[External Payment Gateway]
CircuitBreaker --> Fallback[Fallback Handler]
Payment --> DB[(Database)]
CircuitBreaker --> Monitor[Circuit State<br/>Monitoring]
Key Changes:
- Proper circuit breaker configuration
- Thread pool limits
- Fallback mechanism
- Circuit state monitoring
Key Lessons
-
Circuit breaker configuration matters: Wrong thresholds can make failures worse. Understand your concurrency patterns before configuring.
-
Test resilience patterns: Don't just implement circuit breakers—test them under failure scenarios. Staging should simulate production failures.
-
Monitor circuit state: Know when your circuit breaker opens and closes. Alert on state changes.
-
Have fallbacks: When circuit opens, have a graceful degradation strategy. Don't just fail requests.
-
Consider concurrency: High concurrency changes how circuit breakers behave. Configure accordingly.
Interview Takeaways
Common Questions:
- "How do circuit breakers work?"
- "How do you configure circuit breakers?"
- "What happens when a circuit breaker opens?"
What Interviewers Are Looking For:
- Understanding of circuit breaker pattern
- Knowledge of configuration parameters
- Experience with resilience patterns
- Awareness of failure isolation strategies
What a Senior Engineer Would Do Differently
From the Start:
- Understand configuration: Don't copy configs blindly. Understand what each parameter does.
- Test under failure: Test circuit breakers with actual failures, not just happy paths.
- Monitor circuit state: Track when circuits open/close and alert on state changes.
- Add fallbacks: Always have a graceful degradation strategy.
- Consider concurrency: Configure circuit breakers for your actual concurrency patterns.
The Real Lesson: Resilience patterns are powerful, but misconfiguration can make failures worse. Test, monitor, and understand your configuration.
FAQs
Q: How do circuit breakers work?
A: Circuit breakers monitor request success/failure rates. When error rate exceeds a threshold, the circuit "opens" and stops sending requests to the failing service. After a timeout, it "half-opens" to test if the service recovered, then closes if successful.
Q: How do you configure circuit breaker thresholds?
A: Error threshold should be high enough (70-80%) to avoid false positives but low enough to catch real failures. Volume threshold ensures you have enough data before opening. Consider your concurrency patterns when configuring.
Q: What happens when a circuit breaker opens?
A: The circuit stops sending requests to the failing service and immediately returns an error or calls a fallback function. This prevents cascading failures and gives the failing service time to recover.
Q: Should you always use circuit breakers?
A: Circuit breakers are useful for external service calls, but not always necessary for internal services. Use them when you want to prevent cascading failures and have a fallback strategy.
Q: How do you test circuit breakers?
A: Simulate failures in staging: slow down external services, return errors, or time out requests. Verify that circuits open correctly and fallbacks work. Test recovery scenarios.
Q: What's the difference between circuit breaker and retry?
A: Retries attempt the same request multiple times. Circuit breakers stop sending requests after detecting failures. Use retries for transient failures, circuit breakers for persistent failures.
Q: How do you choose circuit breaker timeouts?
A: Timeout should be shorter than your request timeout (fail fast). Error threshold should reflect your error tolerance. Reset timeout should give the service enough time to recover. Test and adjust based on actual behavior.
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.