Topic Overview
Fault Tolerance: Concepts, Trade-offs & Failure Modes
Learn how to design systems that continue operating correctly even when components fail.
Fault tolerance is the ability of a system to continue operating correctly even when some components fail.
Failure Modes
Crash failures: Node stops responding (most common)
Byzantine failures: Node behaves arbitrarily (malicious or buggy)
Omission failures: Node fails to send/receive messages
Timing failures: Node responds too slowly or too fast
Fault Tolerance Techniques
Redundancy
Replication: Multiple copies of data/services
Active-active: All replicas handle requests
Active-passive: Standby replicas take over on failure
1class RedundantService {2 private replicas: Service[] = [];34 async handleRequest(request: Request Response
Circuit Breaker
Prevents cascading failures by stopping requests to failing services.
1class CircuitBreaker {2 private state: 'closed' | 'open' | 'half-open' = 'closed';3 private failures: number = 0;4 private lastFailureTime: number = 0;56 async call<T>(fn: () => Promise<T>): Promise<T> {7 if (this.state === 'open') {8 if (Date. lastFailureTime timeout
Graceful Degradation
System continues with reduced functionality.
1class DegradableService {2 async getData(): Promise<Data> {3 try {4 return await this.primarySource.get();5 } catch (error) {6 // Fallback to cache or simplified response7 return await this.cache.get() || this.getDefaultData();8 }9 }10}
Examples
Database Replication
1class FaultTolerantDatabase {2 private primary: Database;3 private replicas: Database[] = [];45 async write(data: any): Promise<void> {6 try {7 await this.primary.write(data);8 // Async replication to replicas9 this.replicateAsync(data);10 } catch (error) {11 // Primary failed, promote replica12 await this.promoteReplica(
Common Pitfalls
- Single point of failure: One component brings down system. Fix: Add redundancy
- No failure detection: Don't know when components fail. Fix: Health checks, timeouts
- Cascading failures: One failure causes others. Fix: Circuit breakers, rate limiting
- Not testing failures: System untested under failure. Fix: Chaos engineering
- Ignoring partial failures: System fails completely. Fix: Graceful degradation
Interview Questions
Beginner
Q: What is fault tolerance and why is it important?
A: Fault tolerance is the ability of a system to continue operating correctly even when components fail.
Why important:
- High availability: System stays up even with failures
- Reliability: Users can depend on the system
- Resilience: System recovers from failures
- User experience: Failures don't disrupt users
Example: If one database server fails, system should continue using other servers.
Intermediate
Q: How do you design a fault-tolerant distributed system?
A:
Key techniques:
- Redundancy: Multiple copies of critical components
- Failure detection: Health checks, timeouts, monitoring
- Automatic recovery: Failover, restart failed components
- Isolation: Failures don't cascade
- Graceful degradation: Continue with reduced functionality
Example design:
- Load balancer with multiple backend servers
- Database replication (primary + replicas)
- Circuit breakers to prevent cascading failures
- Health checks to detect failures quickly
- Automatic failover when primary fails
Senior
Q: Design a fault-tolerant microservices architecture. How do you handle service failures, database failures, and network partitions?
A:
Architecture:
- Service redundancy: Multiple instances of each service
- Database replication: Primary + replicas
- Circuit breakers: Prevent cascading failures
- Health checks: Detect failures quickly
- Service mesh: Handle communication resilience
Design:
1class FaultTolerantMicroservices {2 // Service with redundancy3 class ResilientService {4 private instances: ServiceInstance[] = [];5 private circuitBreaker: CircuitBreaker;67 async handleRequest(request: Request): Promise<Response> {8 return await this.circuitBreaker.call(async () => {9 // Try healthy instances10 const healthy = this.instances.filter(i => i.isHealthy()
Failure Handling:
- Service failures: Circuit breaker, retry with backoff, failover to backup
- Database failures: Read from replicas, promote replica to primary
- Network partitions: Continue in degraded mode, sync when partition heals
-
Fault tolerance ensures system continues operating despite failures
-
Redundancy is key: Multiple copies of critical components
-
Failure detection: Health checks, timeouts, monitoring
-
Circuit breakers prevent cascading failures
-
Graceful degradation: Continue with reduced functionality
-
Automatic recovery: Failover, restart, self-healing
-
Test failures: Use chaos engineering to test fault tolerance
-
Heartbeats & Health Checks - Detecting node failures
-
Partition Tolerance - Handling network partitions
-
Leader Election - Electing leaders when nodes fail
-
Replication Lag - Handling replica failures
-
Idempotency - Making operations safe to retry
Key Takeaways
Fault tolerance ensures system continues operating despite failures
Redundancy is key: Multiple copies of critical components
Failure detection: Health checks, timeouts, monitoring
Circuit breakers prevent cascading failures
Graceful degradation: Continue with reduced functionality
Automatic recovery: Failover, restart, self-healing
Test failures: Use chaos engineering to test fault tolerance
What's next?