Topic Overview

Circuit Breaker Pattern

Learn the circuit breaker pattern for fault tolerance. Understand states (closed, open, half-open), implementation, and preventing cascading failures.

The circuit breaker pattern prevents cascading failures in distributed systems by stopping requests to failing services, allowing them to recover, and providing fallback mechanisms.


What is a Circuit Breaker?

Circuit breaker is a design pattern that:

  • Monitors service health
  • Stops requests when service is failing
  • Allows service to recover
  • Resumes requests when service recovers

Analogy: Like electrical circuit breaker - stops current when overloaded, prevents damage.


Circuit Breaker States

1. Closed (Normal Operation)

State: Circuit is closed, requests flow through.

Requests → Circuit Breaker → Service
          (All requests pass)

Behavior:

  • Requests forwarded to service
  • Failures counted
  • If failure threshold reached → Open

2. Open (Failing)

State: Circuit is open, requests blocked.

Requests → Circuit Breaker → [BLOCKED]
          (Requests fail fast)

Behavior:

  • Requests immediately fail (no service call)
  • Returns error or fallback
  • After timeout → Half-Open

3. Half-Open (Testing)

State: Circuit is half-open, testing if service recovered.

Request → Circuit Breaker → Service (test)
          (One request allowed)

Behavior:

  • Allows one test request
  • If successful → Closed
  • If failed → Open

State Transitions

Closed → Open: Failure threshold reached
Open → Half-Open: Timeout elapsed
Half-Open → Closed: Test request succeeds
Half-Open → Open: Test request fails

Implementation

Basic Circuit Breaker

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
    
    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenError('Circuit breaker is OPEN')
        
        try:
            result = func(*args, **kwargs)
            
            # Success
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failures = 0
            
            return result
            
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            
            if self.failures >= self.failure_threshold:
                self.state = 'OPEN'
            
            raise e

Circuit Breaker with Fallback

class CircuitBreakerWithFallback:
    def __init__(self, failure_threshold=5, timeout=60):
        self.circuit_breaker = CircuitBreaker(failure_threshold, timeout)
        self.fallback = None
    
    def set_fallback(self, fallback_func):
        self.fallback = fallback_func
    
    def call(self, func, *args, **kwargs):
        try:
            return self.circuit_breaker.call(func, *args, **kwargs)
        except CircuitBreakerOpenError:
            if self.fallback:
                return self.fallback(*args, **kwargs)
            raise

Circuit Breaker with Metrics

class CircuitBreakerWithMetrics:
    def __init__(self, failure_threshold=5, timeout=60):
        self.circuit_breaker = CircuitBreaker(failure_threshold, timeout)
        self.metrics = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'circuit_open_count': 0
        }
    
    def call(self, func, *args, **kwargs):
        self.metrics['total_requests'] += 1
        
        try:
            result = self.circuit_breaker.call(func, *args, **kwargs)
            self.metrics['successful_requests'] += 1
            return result
        except CircuitBreakerOpenError:
            self.metrics['circuit_open_count'] += 1
            raise
        except Exception as e:
            self.metrics['failed_requests'] += 1
            raise

Examples

HTTP Service with Circuit Breaker

import requests
from circuit_breaker import CircuitBreaker

class HTTPClient:
    def __init__(self):
        self.circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)
    
    def get(self, url):
        def _get():
            response = requests.get(url, timeout=5)
            response.raise_for_status()
            return response.json()
        
        try:
            return self.circuit_breaker.call(_get)
        except CircuitBreakerOpenError:
            # Return cached data or default
            return self.get_cached_data(url)

Microservice with Circuit Breaker

class OrderService:
    def __init__(self):
        self.payment_circuit = CircuitBreaker(failure_threshold=5, timeout=60)
        self.inventory_circuit = CircuitBreaker(failure_threshold=5, timeout=60)
    
    async def create_order(self, order_data):
        # Call payment service with circuit breaker
        try:
            payment = await self.payment_circuit.call(
                self.payment_service.charge,
                order_data.total
            )
        except CircuitBreakerOpenError:
            # Payment service down: Queue for later
            await self.queue_payment(order_data)
            return {'status': 'queued'}
        
        # Call inventory service with circuit breaker
        try:
            inventory = await self.inventory_circuit.call(
                self.inventory_service.reserve,
                order_data.items
            )
        except CircuitBreakerOpenError:
            # Inventory service down: Refund payment
            await self.payment_service.refund(payment.id)
            raise ServiceUnavailableError('Inventory service unavailable')
        
        return {'status': 'success', 'order_id': order.id}

Common Pitfalls

  • Too sensitive: Opens too quickly. Fix: Adjust failure threshold and timeout
  • Too slow to recover: Stays open too long. Fix: Reduce timeout, test recovery
  • No fallback: Returns errors to users. Fix: Implement fallback (cached data, default values)
  • Not monitoring: Don't know when circuit opens. Fix: Add metrics, alerts
  • Ignoring half-open: Not testing recovery. Fix: Implement half-open state properly

Interview Questions

Beginner

Q: What is the circuit breaker pattern and why is it used?

A:

Circuit breaker pattern prevents cascading failures by stopping requests to failing services.

Why used:

  1. Prevent cascading failures: Failing service doesn't bring down entire system
  2. Allow recovery: Gives failing service time to recover
  3. Fail fast: Returns error immediately instead of waiting for timeout
  4. Resource protection: Prevents overwhelming failing service

States:

  • Closed: Normal operation, requests pass through
  • Open: Service failing, requests blocked (fail fast)
  • Half-Open: Testing if service recovered

Example:

Service A calls Service B
Service B starts failing
Circuit breaker opens after 5 failures
Service A gets immediate error (no wait)
Service B has time to recover
After timeout, circuit breaker tests (half-open)
If successful, circuit closes (normal operation)

Intermediate

Q: Explain the circuit breaker states and transitions. How do you implement it?

A:

States:

  1. Closed (Normal)

    Requests → Circuit Breaker → Service
    Failures counted
    If failures >= threshold → Open
    
  2. Open (Failing)

    Requests → Circuit Breaker → [BLOCKED]
    Returns error immediately (fail fast)
    After timeout → Half-Open
    
  3. Half-Open (Testing)

    Request → Circuit Breaker → Service (test)
    If success → Closed
    If failure → Open
    

Implementation:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = 'CLOSED'
    
    def call(self, func):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenError()
        
        try:
            result = func()
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failures = 0
            return result
        except Exception:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = 'OPEN'
            raise

Transitions:

  • Closed → Open: Failure threshold reached
  • Open → Half-Open: Timeout elapsed
  • Half-Open → Closed: Test succeeds
  • Half-Open → Open: Test fails

Senior

Q: Design a circuit breaker system for a microservices architecture. How do you handle different failure types, implement fallbacks, and monitor circuit breaker health?

A:

class MicroservicesCircuitBreaker {
  private breakers: Map<string, CircuitBreaker>;
  private fallbackManager: FallbackManager;
  private metrics: Metrics;
  
  constructor() {
    this.breakers = new Map();
    this.fallbackManager = new FallbackManager();
    this.metrics = new Metrics();
  }
  
  // 1. Circuit Breaker per Service
  getCircuitBreaker(serviceName: string): CircuitBreaker {
    if (!this.breakers.has(serviceName)) {
      this.breakers.set(serviceName, new CircuitBreaker({
        failureThreshold: 5,
        timeout: 60000, // 1 minute
        halfOpenMaxCalls: 3
      }));
    }
    return this.breakers.get(serviceName);
  }
  
  // 2. Call with Circuit Breaker
  async call(serviceName: string, method: string, ...args: any[]): Promise<any> {
    const breaker = this.getCircuitBreaker(serviceName);
    const service = this.getService(serviceName);
    
    try {
      return await breaker.execute(async () => {
        return await service[method](...args);
      });
    } catch (error) {
      if (error instanceof CircuitBreakerOpenError) {
        // Circuit open: Use fallback
        return await this.fallbackManager.execute(serviceName, method, args);
      }
      throw error;
    }
  }
  
  // 3. Advanced Circuit Breaker
  class CircuitBreaker {
    private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
    private failures: number = 0;
    private lastFailureTime: number = 0;
    private halfOpenCalls: number = 0;
    
    async execute<T>(fn: () => Promise<T>): Promise<T> {
      // Check state
      if (this.state === 'OPEN') {
        if (Date.now() - this.lastFailureTime > this.timeout) {
          this.state = 'HALF_OPEN';
          this.halfOpenCalls = 0;
        } else {
          throw new CircuitBreakerOpenError();
        }
      }
      
      // Half-open: Limit test calls
      if (this.state === 'HALF_OPEN') {
        if (this.halfOpenCalls >= this.halfOpenMaxCalls) {
          throw new CircuitBreakerOpenError();
        }
        this.halfOpenCalls++;
      }
      
      try {
        const result = await this.withTimeout(fn, 5000);
        
        // Success
        if (this.state === 'HALF_OPEN') {
          this.state = 'CLOSED';
          this.failures = 0;
        } else {
          this.failures = Math.max(0, this.failures - 1); // Decay
        }
        
        return result;
      } catch (error) {
        this.recordFailure(error);
        throw error;
      }
    }
    
    recordFailure(error: Error): void {
      this.failures++;
      this.lastFailureTime = Date.now();
      
      // Different thresholds for different error types
      const threshold = this.getThreshold(error);
      
      if (this.failures >= threshold) {
        this.state = 'OPEN';
        this.metrics.recordCircuitOpen(this.serviceName);
      }
    }
    
    getThreshold(error: Error): number {
      // Timeout: Lower threshold (service slow)
      if (error instanceof TimeoutError) {
        return 3;
      }
      
      // 5xx: Service error (higher threshold)
      if (error.status >= 500) {
        return 5;
      }
      
      // 4xx: Client error (don't count)
      if (error.status >= 400 && error.status < 500) {
        return Infinity; // Don't open circuit
      }
      
      return this.failureThreshold;
    }
  }
  
  // 4. Fallback Strategies
  class FallbackManager {
    async execute(serviceName: string, method: string, args: any[]): Promise<any> {
      // Strategy 1: Cached data
      const cached = await this.getCached(serviceName, method, args);
      if (cached) {
        return cached;
      }
      
      // Strategy 2: Default values
      const defaults = this.getDefaults(serviceName, method);
      if (defaults) {
        return defaults;
      }
      
      // Strategy 3: Alternative service
      const alternative = this.getAlternative(serviceName);
      if (alternative) {
        return await alternative[method](...args);
      }
      
      // Strategy 4: Queue for later
      await this.queueRequest(serviceName, method, args);
      throw new ServiceUnavailableError();
    }
  }
  
  // 5. Monitoring
  class Metrics {
    recordCircuitOpen(serviceName: string): void {
      // Alert when circuit opens
      this.alert({
        service: serviceName,
        event: 'circuit_open',
        timestamp: Date.now()
      });
    }
    
    getHealth(): HealthStatus {
      const status = {};
      for (const [service, breaker] of this.breakers.entries()) {
        status[service] = {
          state: breaker.state,
          failures: breaker.failures,
          lastFailure: breaker.lastFailureTime
        };
      }
      return status;
    }
  }
}

Features:

  1. Per-service circuit breakers: Separate breaker for each service
  2. Error type handling: Different thresholds for different errors
  3. Fallback strategies: Cached data, defaults, alternative services
  4. Monitoring: Track circuit state, failures, alerts
  5. Timeout handling: Prevent hanging requests

Key Takeaways

  • Circuit breaker: Prevents cascading failures by stopping requests to failing services
  • States: Closed (normal), Open (failing), Half-Open (testing)
  • Transitions: Closed → Open (threshold), Open → Half-Open (timeout), Half-Open → Closed/Open (test result)
  • Implementation: Monitor failures, open circuit at threshold, test recovery
  • Fallback: Return cached data, defaults, or queue requests when circuit open
  • Benefits: Fail fast, allow recovery, prevent cascading failures
  • Best practices: Adjust thresholds, implement fallbacks, monitor health, handle different error types

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.