Real Engineering Stories

The Circuit Breaker That Didn't Break

A production incident where a misconfigured circuit breaker allowed cascading failures to propagate, taking down multiple services. Learn about circuit breaker patterns, failure isolation, and resilience.

Advanced25 min read

This is a story about how a circuit breaker that was supposed to prevent failures actually made them worse. It's also about why understanding resilience patterns matters, and how we learned that configuration is as important as implementation.

Context

We had a microservices architecture with a payment service that processed transactions. The payment service called an external payment gateway API. We implemented a circuit breaker to prevent cascading failures if the gateway was down.

Original Architecture:

Technology Choices:

Payment Service: Node.js microservice
Circuit Breaker: Opossum library (Node.js)
External Gateway: Third-party payment API
Database: PostgreSQL for transaction records

Assumptions Made:

Circuit breaker would prevent cascading failures
Configuration was correct (we copied from documentation)
Circuit breaker would open quickly when gateway failed

The Incident

11:00 AM

External payment gateway started experiencing issues (high latency)

11:05 AM

Payment service requests started timing out

11:06 AM

Circuit breaker should have opened, but didn't

11:07 AM

Payment service threads blocked waiting for gateway

11:08 AM

Payment service stopped processing new requests

11:09 AM

API Gateway couldn't reach payment service

11:10 AM

Checkout flow completely broken

11:12 AM

On-call engineer paged

11:15 AM

Identified circuit breaker misconfiguration

11:20 AM

Circuit breaker configuration fixed

11:25 AM

Service recovered, but 15 minutes of transactions lost

Symptoms

What We Saw:

Payment Service: Stopped responding to requests
Error Rate: Jumped from 0.1% to 100% for payment requests
Response Time: All payment requests timing out
Thread Pool: All threads blocked waiting for gateway
User Impact: ~50K checkout attempts failed, revenue loss

How We Detected It:

Alert fired when payment service stopped responding
Dashboard showed all payment requests failing
External gateway status page showed issues

Monitoring Gaps:

No alert for circuit breaker state
No alert for thread pool exhaustion
No alert for external service latency

Root Cause Analysis

Primary Cause: Misconfigured circuit breaker threshold.

The Bug:

// BAD CONFIGURATION
const circuitBreaker = new CircuitBreaker(callGateway, {
  timeout: 5000, // 5 second timeout
  errorThresholdPercentage: 50, // Open after 50% errors
  resetTimeout: 30000, // Reset after 30 seconds
  // PROBLEM: Only 2 requests needed to trigger 50% threshold!
  // If 1 request fails, next request opens circuit
  // But we have 100 concurrent requests...
});

What Happened:

External gateway started experiencing high latency (5+ seconds)
Payment service had 100 concurrent requests to gateway
Circuit breaker error threshold was 50% (too low for high concurrency)
First 50 requests timed out, circuit should have opened
But circuit breaker only tracked last 10 requests (default window)
With 100 concurrent requests, circuit breaker couldn't track properly
Circuit breaker never opened, all requests kept trying
All payment service threads blocked waiting for gateway
Payment service stopped processing new requests
Cascading failure propagated to API Gateway

Why It Was So Bad:

Wrong threshold: 50% error threshold too low for high concurrency
Small window: Only tracking last 10 requests, not all concurrent requests
No thread pool limits: All threads blocked, no capacity for new requests
No fallback: No graceful degradation when circuit should open

Contributing Factors:

Configuration copied from documentation without understanding
No testing of circuit breaker under failure scenarios
No monitoring of circuit breaker state
High concurrency (100+ requests) not considered in configuration

Fix & Mitigation

Immediate Fix:

// FIXED CONFIGURATION
const circuitBreaker = new CircuitBreaker(callGateway, {
  timeout: 2000, // 2 second timeout (fail fast)
  errorThresholdPercentage: 80, // Open after 80% errors (higher threshold)
  resetTimeout: 60000, // Reset after 60 seconds
  volumeThreshold: 20, // Need at least 20 requests before opening
  rollingCountTimeout: 10000, // 10 second window
  rollingCountBuckets: 10, // 10 buckets for better tracking
  // Add fallback
  fallback: () => {
    return { error: 'Payment service temporarily unavailable' };
  }
});

Long-Term Improvements:

Circuit Breaker Configuration:
- Increased error threshold to 80% (more appropriate for high concurrency)
- Added volume threshold (need minimum requests before opening)
- Added proper rolling window configuration
- Added fallback mechanism
Thread Pool Management:
- Added thread pool size limits
- Added thread pool monitoring
- Added alert for thread pool exhaustion
Monitoring & Alerting:
- Added circuit breaker state monitoring
- Added alert when circuit opens
- Added external service latency monitoring
- Added fallback usage tracking
Process Improvements:
- Required circuit breaker testing in staging
- Added resilience testing to CI/CD
- Created runbook for circuit breaker incidents
- Added configuration review process

Architecture After Fix

Key Changes:

Proper circuit breaker configuration
Thread pool limits
Fallback mechanism
Circuit state monitoring

Key Lessons

Circuit breaker configuration matters: Wrong thresholds can make failures worse. Understand your concurrency patterns before configuring.
Test resilience patterns: Don't just implement circuit breakers—test them under failure scenarios. Staging should simulate production failures.
Monitor circuit state: Know when your circuit breaker opens and closes. Alert on state changes.
Have fallbacks: When circuit opens, have a graceful degradation strategy. Don't just fail requests.
Consider concurrency: High concurrency changes how circuit breakers behave. Configure accordingly.

Interview Takeaways

Common Questions:

"How do circuit breakers work?"
"How do you configure circuit breakers?"
"What happens when a circuit breaker opens?"

What Interviewers Are Looking For:

Understanding of circuit breaker pattern
Knowledge of configuration parameters
Experience with resilience patterns
Awareness of failure isolation strategies

What a Senior Engineer Would Do Differently

From the Start:

Understand configuration: Don't copy configs blindly. Understand what each parameter does.
Test under failure: Test circuit breakers with actual failures, not just happy paths.
Monitor circuit state: Track when circuits open/close and alert on state changes.
Add fallbacks: Always have a graceful degradation strategy.
Consider concurrency: Configure circuit breakers for your actual concurrency patterns.

The Real Lesson: Resilience patterns are powerful, but misconfiguration can make failures worse. Test, monitor, and understand your configuration.

FAQs

Q: How do circuit breakers work?

A: Circuit breakers monitor request success/failure rates. When error rate exceeds a threshold, the circuit "opens" and stops sending requests to the failing service. After a timeout, it "half-opens" to test if the service recovered, then closes if successful.

Q: How do you configure circuit breaker thresholds?

A: Error threshold should be high enough (70-80%) to avoid false positives but low enough to catch real failures. Volume threshold ensures you have enough data before opening. Consider your concurrency patterns when configuring.

Q: What happens when a circuit breaker opens?

A: The circuit stops sending requests to the failing service and immediately returns an error or calls a fallback function. This prevents cascading failures and gives the failing service time to recover.

Q: Should you always use circuit breakers?

A: Circuit breakers are useful for external service calls, but not always necessary for internal services. Use them when you want to prevent cascading failures and have a fallback strategy.

Q: How do you test circuit breakers?

A: Simulate failures in staging: slow down external services, return errors, or time out requests. Verify that circuits open correctly and fallbacks work. Test recovery scenarios.

Q: What's the difference between circuit breaker and retry?

A: Retries attempt the same request multiple times. Circuit breakers stop sending requests after detecting failures. Use retries for transient failures, circuit breakers for persistent failures.

Q: How do you choose circuit breaker timeouts?

A: Timeout should be shorter than your request timeout (fail fast). Error threshold should reflect your error tolerance. Reset timeout should give the service enough time to recover. Test and adjust based on actual behavior.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design a Notification System

Apply circuit breaker lessons to design a resilient notification system with proper failure isolation.

Medium

Design Instagram

Use insights from failure isolation to design a resilient microservices architecture.

Hard

Explore More Practice Questions →

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →