Topic Overview

Fault Tolerance

Learn how to design systems that continue operating correctly even when components fail.

Fault tolerance is the ability of a system to continue operating correctly even when some components fail.


Failure Modes

Crash failures: Node stops responding (most common)

Byzantine failures: Node behaves arbitrarily (malicious or buggy)

Omission failures: Node fails to send/receive messages

Timing failures: Node responds too slowly or too fast


Fault Tolerance Techniques

Redundancy

Replication: Multiple copies of data/services

Active-active: All replicas handle requests

Active-passive: Standby replicas take over on failure

class RedundantService {
  private replicas: Service[] = [];

  async handleRequest(request: Request): Promise<Response> {
    // Try primary first
    try {
      return await this.replicas[0].process(request);
    } catch (error) {
      // Failover to secondary
      return await this.replicas[1].process(request);
    }
  }
}

Circuit Breaker

Prevents cascading failures by stopping requests to failing services.

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures: number = 0;
  private lastFailureTime: number = 0;

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = 'half-open'; // Try again
      } else {
        throw new Error('Circuit breaker open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess(): void {
    this.failures = 0;
    this.state = 'closed';
  }

  onFailure(): void {
    this.failures++;
    this.lastFailureTime = Date.now();
    
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

Graceful Degradation

System continues with reduced functionality.

class DegradableService {
  async getData(): Promise<Data> {
    try {
      return await this.primarySource.get();
    } catch (error) {
      // Fallback to cache or simplified response
      return await this.cache.get() || this.getDefaultData();
    }
  }
}

Examples

Database Replication

class FaultTolerantDatabase {
  private primary: Database;
  private replicas: Database[] = [];

  async write(data: any): Promise<void> {
    try {
      await this.primary.write(data);
      // Async replication to replicas
      this.replicateAsync(data);
    } catch (error) {
      // Primary failed, promote replica
      await this.promoteReplica();
      throw error;
    }
  }

  async read(): Promise<Data> {
    try {
      return await this.primary.read();
    } catch (error) {
      // Read from replica
      return await this.replicas[0].read();
    }
  }
}

Common Pitfalls

  • Single point of failure: One component brings down system. Fix: Add redundancy
  • No failure detection: Don't know when components fail. Fix: Health checks, timeouts
  • Cascading failures: One failure causes others. Fix: Circuit breakers, rate limiting
  • Not testing failures: System untested under failure. Fix: Chaos engineering
  • Ignoring partial failures: System fails completely. Fix: Graceful degradation

Interview Questions

Beginner

Q: What is fault tolerance and why is it important?

A: Fault tolerance is the ability of a system to continue operating correctly even when components fail.

Why important:

  • High availability: System stays up even with failures
  • Reliability: Users can depend on the system
  • Resilience: System recovers from failures
  • User experience: Failures don't disrupt users

Example: If one database server fails, system should continue using other servers.


Intermediate

Q: How do you design a fault-tolerant distributed system?

A:

Key techniques:

  1. Redundancy: Multiple copies of critical components
  2. Failure detection: Health checks, timeouts, monitoring
  3. Automatic recovery: Failover, restart failed components
  4. Isolation: Failures don't cascade
  5. Graceful degradation: Continue with reduced functionality

Example design:

  • Load balancer with multiple backend servers
  • Database replication (primary + replicas)
  • Circuit breakers to prevent cascading failures
  • Health checks to detect failures quickly
  • Automatic failover when primary fails

Senior

Q: Design a fault-tolerant microservices architecture. How do you handle service failures, database failures, and network partitions?

A:

Architecture:

  • Service redundancy: Multiple instances of each service
  • Database replication: Primary + replicas
  • Circuit breakers: Prevent cascading failures
  • Health checks: Detect failures quickly
  • Service mesh: Handle communication resilience

Design:

class FaultTolerantMicroservices {
  // Service with redundancy
  class ResilientService {
    private instances: ServiceInstance[] = [];
    private circuitBreaker: CircuitBreaker;

    async handleRequest(request: Request): Promise<Response> {
      return await this.circuitBreaker.call(async () => {
        // Try healthy instances
        const healthy = this.instances.filter(i => i.isHealthy());
        if (healthy.length === 0) {
          throw new Error('No healthy instances');
        }

        // Load balance across healthy instances
        const instance = this.selectInstance(healthy);
        return await instance.process(request);
      });
    }

    async healthCheck(): Promise<void> {
      for (const instance of this.instances) {
        try {
          await instance.healthCheck();
          instance.markHealthy();
        } catch (error) {
          instance.markUnhealthy();
        }
      }
    }
  }

  // Database with replication
  class ResilientDatabase {
    private primary: Database;
    private replicas: Database[] = [];
    private replicationLag: number = 0;

    async write(data: any): Promise<void> {
      try {
        await this.primary.write(data);
        // Async replication
        this.replicateAsync(data);
      } catch (error) {
        // Primary failed, promote replica
        await this.promoteReplica();
        throw error;
      }
    }

    async read(consistency: 'strong' | 'eventual'): Promise<Data> {
      if (consistency === 'strong') {
        return await this.primary.read();
      } else {
        // Read from replica (faster, may be slightly stale)
        return await this.replicas[0].read();
      }
    }
  }

  // Network partition handling
  class PartitionAwareService {
    async handlePartition(): Promise<void> {
      const reachable = await this.checkConnectivity();
      
      if (reachable.length < this.quorum) {
        // Minority partition - operate in degraded mode
        this.degradedMode = true;
        // Continue with limited functionality
      } else {
        // Majority partition - normal operation
        this.degradedMode = false;
      }
    }
  }
}

Failure Handling:

  1. Service failures: Circuit breaker, retry with backoff, failover to backup
  2. Database failures: Read from replicas, promote replica to primary
  3. Network partitions: Continue in degraded mode, sync when partition heals

Key Takeaways

  • Fault tolerance ensures system continues operating despite failures
  • Redundancy is key: Multiple copies of critical components
  • Failure detection: Health checks, timeouts, monitoring
  • Circuit breakers prevent cascading failures
  • Graceful degradation: Continue with reduced functionality
  • Automatic recovery: Failover, restart, self-healing
  • Test failures: Use chaos engineering to test fault tolerance

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.