Topic Overview

Circuit Breaker Pattern

Learn the circuit breaker pattern for fault tolerance. Understand states (closed, open, half-open), implementation, and preventing cascading failures.

10 min read

The circuit breaker pattern prevents cascading failures in distributed systems by stopping requests to failing services, allowing them to recover, and providing fallback mechanisms.


What is a Circuit Breaker?

Circuit breaker is a design pattern that:

  • Monitors service health
  • Stops requests when service is failing
  • Allows service to recover
  • Resumes requests when service recovers

Analogy: Like electrical circuit breaker - stops current when overloaded, prevents damage.


Circuit Breaker States

1. Closed (Normal Operation)

State: Circuit is closed, requests flow through.

Requests → Circuit Breaker → Service
          (All requests pass)

Behavior:

  • Requests forwarded to service
  • Failures counted
  • If failure threshold reached → Open

2. Open (Failing)

State: Circuit is open, requests blocked.

Requests → Circuit Breaker → [BLOCKED]
          (Requests fail fast)

Behavior:

  • Requests immediately fail (no service call)
  • Returns error or fallback
  • After timeout → Half-Open

3. Half-Open (Testing)

State: Circuit is half-open, testing if service recovered.

Request → Circuit Breaker → Service (test)
          (One request allowed)

Behavior:

  • Allows one test request
  • If successful → Closed
  • If failed → Open

State Transitions

Closed → Open: Failure threshold reached
Open → Half-Open: Timeout elapsed
Half-Open → Closed: Test request succeeds
Half-Open → Open: Test request fails

Implementation

Basic Circuit Breaker

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
    
    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenError('Circuit breaker is OPEN')
        
        try:
            result = func(*args, **kwargs)
            
            # Success
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failures = 0
            
            return result
            
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            
            if self.failures >= self.failure_threshold:
                self.state = 'OPEN'
            
            raise e

Circuit Breaker with Fallback

class CircuitBreakerWithFallback:
    def __init__(self, failure_threshold=5, timeout=60):
        self.circuit_breaker = CircuitBreaker(failure_threshold, timeout)
        self.fallback = None
    
    def set_fallback(self, fallback_func):
        self.fallback = fallback_func
    
    def call(self, func, *args, **kwargs):
        try:
            return self.circuit_breaker.call(func, *args, **kwargs)
        except CircuitBreakerOpenError:
            if self.fallback:
                return self.fallback(*args, **kwargs)
            raise

Circuit Breaker with Metrics

class CircuitBreakerWithMetrics:
    def __init__(self, failure_threshold=5, timeout=60):
        self.circuit_breaker = CircuitBreaker(failure_threshold, timeout)
        self.metrics = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'circuit_open_count': 0
        }
    
    def call(self, func, *args, **kwargs):
        self.metrics['total_requests'] += 1
        
        try:
            result = self.circuit_breaker.call(func, *args, **kwargs)
            self.metrics['successful_requests'] += 1
            return result
        except CircuitBreakerOpenError:
            self.metrics['circuit_open_count'] += 1
            raise
        except Exception as e:
            self.metrics['failed_requests'] += 1
            raise

Examples

HTTP Service with Circuit Breaker

import requests
from circuit_breaker import CircuitBreaker

class HTTPClient:
    def __init__(self):
        self.circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)
    
    def get(self, url):
        def _get():
            response = requests.get(url, timeout=5)
            response.raise_for_status()
            return response.json()
        
        try:
            return self.circuit_breaker.call(_get)
        except CircuitBreakerOpenError:
            # Return cached data or default
            return self.get_cached_data(url)

Microservice with Circuit Breaker

class OrderService:
    def __init__(self):
        self.payment_circuit = CircuitBreaker(failure_threshold=5, timeout=60)
        self.inventory_circuit = CircuitBreaker(failure_threshold=5, timeout=60)
    
    async def create_order(self, order_data):
        # Call payment service with circuit breaker
        try:
            payment = await self.payment_circuit.call(
                self.payment_service.charge,
                order_data.total
            )
        except CircuitBreakerOpenError:
            # Payment service down: Queue for later
            await self.queue_payment(order_data)
            return {'status': 'queued'}
        
        # Call inventory service with circuit breaker
        try:
            inventory = await self.inventory_circuit.call(
                self.inventory_service.reserve,
                order_data.items
            )
        except CircuitBreakerOpenError:
            # Inventory service down: Refund payment
            await self.payment_service.refund(payment.id)
            raise ServiceUnavailableError('Inventory service unavailable')
        
        return {'status': 'success', 'order_id': order.id}

Common Pitfalls

  • Too sensitive: Opens too quickly. Fix: Adjust failure threshold and timeout
  • Too slow to recover: Stays open too long. Fix: Reduce timeout, test recovery
  • No fallback: Returns errors to users. Fix: Implement fallback (cached data, default values)
  • Not monitoring: Don't know when circuit opens. Fix: Add metrics, alerts
  • Ignoring half-open: Not testing recovery. Fix: Implement half-open state properly

Interview Questions

Beginner

Q: What is the circuit breaker pattern and why is it used?

A:

Circuit breaker pattern prevents cascading failures by stopping requests to failing services.

Why used:

  1. Prevent cascading failures: Failing service doesn't bring down entire system
  2. Allow recovery: Gives failing service time to recover
  3. Fail fast: Returns error immediately instead of waiting for timeout
  4. Resource protection: Prevents overwhelming failing service

States:

  • Closed: Normal operation, requests pass through
  • Open: Service failing, requests blocked (fail fast)
  • Half-Open: Testing if service recovered

Example:

Service A calls Service B
Service B starts failing
Circuit breaker opens after 5 failures
Service A gets immediate error (no wait)
Service B has time to recover
After timeout, circuit breaker tests (half-open)
If successful, circuit closes (normal operation)

Intermediate

Q: Explain the circuit breaker states and transitions. How do you implement it?

A:

States:

  1. Closed (Normal)

    Requests → Circuit Breaker → Service
    Failures counted
    If failures >= threshold → Open
    
  2. Open (Failing)

    Requests → Circuit Breaker → [BLOCKED]
    Returns error immediately (fail fast)
    After timeout → Half-Open
    
  3. Half-Open (Testing)

    Request → Circuit Breaker → Service (test)
    If success → Closed
    If failure → Open
    

Implementation:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = 'CLOSED'
    
    def call(self, func):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenError()
        
        try:
            result = func()
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failures = 0
            return result
        except Exception:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = 'OPEN'
            raise

Transitions:

  • Closed → Open: Failure threshold reached
  • Open → Half-Open: Timeout elapsed
  • Half-Open → Closed: Test succeeds
  • Half-Open → Open: Test fails

Senior

Q: Design a circuit breaker system for a microservices architecture. How do you handle different failure types, implement fallbacks, and monitor circuit breaker health?

A:

class MicroservicesCircuitBreaker {
  private breakers: Map<string, CircuitBreaker>;
  private fallbackManager: FallbackManager;
  private metrics: Metrics;
  
  constructor() {
    this.breakers = new Map();
    this.fallbackManager = new FallbackManager();
    this.metrics = new Metrics();
  }
  
  // 1. Circuit Breaker per Service
  getCircuitBreaker(serviceName: string): CircuitBreaker {
    if (!this.breakers.has(serviceName)) {
      this.breakers.set(serviceName, new CircuitBreaker({
        failureThreshold: 5,
        timeout: 60000, // 1 minute
        halfOpenMaxCalls: 3
      }));
    }
    return this.breakers.get(serviceName);
  }
  
  // 2. Call with Circuit Breaker
  async call(serviceName: string, method: string, ...args: any[]): Promise<any> {
    const breaker = this.getCircuitBreaker(serviceName);
    const service = this.getService(serviceName);
    
    try {
      return await breaker.execute(async () => {
        return await service[method](...args);
      });
    } catch (error) {
      if (error instanceof CircuitBreakerOpenError) {
        // Circuit open: Use fallback
        return await this.fallbackManager.execute(serviceName, method, args);
      }
      throw error;
    }
  }
  
  // 3. Advanced Circuit Breaker
  class CircuitBreaker {
    private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
    private failures: number = 0;
    private lastFailureTime: number = 0;
    private halfOpenCalls: number = 0;
    
    async execute<T>(fn: () => Promise<T>): Promise<T> {
      // Check state
      if (this.state === 'OPEN') {
        if (Date.now() - this.lastFailureTime > this.timeout) {
          this.state = 'HALF_OPEN';
          this.halfOpenCalls = 0;
        } else {
          throw new CircuitBreakerOpenError();
        }
      }
      
      // Half-open: Limit test calls
      if (this.state === 'HALF_OPEN') {
        if (this.halfOpenCalls >= this.halfOpenMaxCalls) {
          throw new CircuitBreakerOpenError();
        }
        this.halfOpenCalls++;
      }
      
      try {
        const result = await this.withTimeout(fn, 5000);
        
        // Success
        if (this.state === 'HALF_OPEN') {
          this.state = 'CLOSED';
          this.failures = 0;
        } else {
          this.failures = Math.max(0, this.failures - 1); // Decay
        }
        
        return result;
      } catch (error) {
        this.recordFailure(error);
        throw error;
      }
    }
    
    recordFailure(error: Error): void {
      this.failures++;
      this.lastFailureTime = Date.now();
      
      // Different thresholds for different error types
      const threshold = this.getThreshold(error);
      
      if (this.failures >= threshold) {
        this.state = 'OPEN';
        this.metrics.recordCircuitOpen(this.serviceName);
      }
    }
    
    getThreshold(error: Error): number {
      // Timeout: Lower threshold (service slow)
      if (error instanceof TimeoutError) {
        return 3;
      }
      
      // 5xx: Service error (higher threshold)
      if (error.status >= 500) {
        return 5;
      }
      
      // 4xx: Client error (don't count)
      if (error.status >= 400 && error.status < 500) {
        return Infinity; // Don't open circuit
      }
      
      return this.failureThreshold;
    }
  }
  
  // 4. Fallback Strategies
  class FallbackManager {
    async execute(serviceName: string, method: string, args: any[]): Promise<any> {
      // Strategy 1: Cached data
      const cached = await this.getCached(serviceName, method, args);
      if (cached) {
        return cached;
      }
      
      // Strategy 2: Default values
      const defaults = this.getDefaults(serviceName, method);
      if (defaults) {
        return defaults;
      }
      
      // Strategy 3: Alternative service
      const alternative = this.getAlternative(serviceName);
      if (alternative) {
        return await alternative[method](...args);
      }
      
      // Strategy 4: Queue for later
      await this.queueRequest(serviceName, method, args);
      throw new ServiceUnavailableError();
    }
  }
  
  // 5. Monitoring
  class Metrics {
    recordCircuitOpen(serviceName: string): void {
      // Alert when circuit opens
      this.alert({
        service: serviceName,
        event: 'circuit_open',
        timestamp: Date.now()
      });
    }
    
    getHealth(): HealthStatus {
      const status = {};
      for (const [service, breaker] of this.breakers.entries()) {
        status[service] = {
          state: breaker.state,
          failures: breaker.failures,
          lastFailure: breaker.lastFailureTime
        };
      }
      return status;
    }
  }
}

Features:

  1. Per-service circuit breakers: Separate breaker for each service
  2. Error type handling: Different thresholds for different errors
  3. Fallback strategies: Cached data, defaults, alternative services
  4. Monitoring: Track circuit state, failures, alerts
  5. Timeout handling: Prevent hanging requests

  • Circuit breaker: Prevents cascading failures by stopping requests to failing services
  • States: Closed (normal), Open (failing), Half-Open (testing)
  • Transitions: Closed → Open (threshold), Open → Half-Open (timeout), Half-Open → Closed/Open (test result)
  • Implementation: Monitor failures, open circuit at threshold, test recovery
  • Fallback: Return cached data, defaults, or queue requests when circuit open
  • Benefits: Fail fast, allow recovery, prevent cascading failures
  • Best practices: Adjust thresholds, implement fallbacks, monitor health, handle different error types

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.