Topic Overview
Fault Tolerance
Learn how to design systems that continue operating correctly even when components fail.
Fault tolerance is the ability of a system to continue operating correctly even when some components fail.
Failure Modes
Crash failures: Node stops responding (most common)
Byzantine failures: Node behaves arbitrarily (malicious or buggy)
Omission failures: Node fails to send/receive messages
Timing failures: Node responds too slowly or too fast
Fault Tolerance Techniques
Redundancy
Replication: Multiple copies of data/services
Active-active: All replicas handle requests
Active-passive: Standby replicas take over on failure
class RedundantService {
private replicas: Service[] = [];
async handleRequest(request: Request): Promise<Response> {
// Try primary first
try {
return await this.replicas[0].process(request);
} catch (error) {
// Failover to secondary
return await this.replicas[1].process(request);
}
}
}
Circuit Breaker
Prevents cascading failures by stopping requests to failing services.
class CircuitBreaker {
private state: 'closed' | 'open' | 'half-open' = 'closed';
private failures: number = 0;
private lastFailureTime: number = 0;
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailureTime > this.timeout) {
this.state = 'half-open'; // Try again
} else {
throw new Error('Circuit breaker open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess(): void {
this.failures = 0;
this.state = 'closed';
}
onFailure(): void {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
}
Graceful Degradation
System continues with reduced functionality.
class DegradableService {
async getData(): Promise<Data> {
try {
return await this.primarySource.get();
} catch (error) {
// Fallback to cache or simplified response
return await this.cache.get() || this.getDefaultData();
}
}
}
Examples
Database Replication
class FaultTolerantDatabase {
private primary: Database;
private replicas: Database[] = [];
async write(data: any): Promise<void> {
try {
await this.primary.write(data);
// Async replication to replicas
this.replicateAsync(data);
} catch (error) {
// Primary failed, promote replica
await this.promoteReplica();
throw error;
}
}
async read(): Promise<Data> {
try {
return await this.primary.read();
} catch (error) {
// Read from replica
return await this.replicas[0].read();
}
}
}
Common Pitfalls
- Single point of failure: One component brings down system. Fix: Add redundancy
- No failure detection: Don't know when components fail. Fix: Health checks, timeouts
- Cascading failures: One failure causes others. Fix: Circuit breakers, rate limiting
- Not testing failures: System untested under failure. Fix: Chaos engineering
- Ignoring partial failures: System fails completely. Fix: Graceful degradation
Interview Questions
Beginner
Q: What is fault tolerance and why is it important?
A: Fault tolerance is the ability of a system to continue operating correctly even when components fail.
Why important:
- High availability: System stays up even with failures
- Reliability: Users can depend on the system
- Resilience: System recovers from failures
- User experience: Failures don't disrupt users
Example: If one database server fails, system should continue using other servers.
Intermediate
Q: How do you design a fault-tolerant distributed system?
A:
Key techniques:
- Redundancy: Multiple copies of critical components
- Failure detection: Health checks, timeouts, monitoring
- Automatic recovery: Failover, restart failed components
- Isolation: Failures don't cascade
- Graceful degradation: Continue with reduced functionality
Example design:
- Load balancer with multiple backend servers
- Database replication (primary + replicas)
- Circuit breakers to prevent cascading failures
- Health checks to detect failures quickly
- Automatic failover when primary fails
Senior
Q: Design a fault-tolerant microservices architecture. How do you handle service failures, database failures, and network partitions?
A:
Architecture:
- Service redundancy: Multiple instances of each service
- Database replication: Primary + replicas
- Circuit breakers: Prevent cascading failures
- Health checks: Detect failures quickly
- Service mesh: Handle communication resilience
Design:
class FaultTolerantMicroservices {
// Service with redundancy
class ResilientService {
private instances: ServiceInstance[] = [];
private circuitBreaker: CircuitBreaker;
async handleRequest(request: Request): Promise<Response> {
return await this.circuitBreaker.call(async () => {
// Try healthy instances
const healthy = this.instances.filter(i => i.isHealthy());
if (healthy.length === 0) {
throw new Error('No healthy instances');
}
// Load balance across healthy instances
const instance = this.selectInstance(healthy);
return await instance.process(request);
});
}
async healthCheck(): Promise<void> {
for (const instance of this.instances) {
try {
await instance.healthCheck();
instance.markHealthy();
} catch (error) {
instance.markUnhealthy();
}
}
}
}
// Database with replication
class ResilientDatabase {
private primary: Database;
private replicas: Database[] = [];
private replicationLag: number = 0;
async write(data: any): Promise<void> {
try {
await this.primary.write(data);
// Async replication
this.replicateAsync(data);
} catch (error) {
// Primary failed, promote replica
await this.promoteReplica();
throw error;
}
}
async read(consistency: 'strong' | 'eventual'): Promise<Data> {
if (consistency === 'strong') {
return await this.primary.read();
} else {
// Read from replica (faster, may be slightly stale)
return await this.replicas[0].read();
}
}
}
// Network partition handling
class PartitionAwareService {
async handlePartition(): Promise<void> {
const reachable = await this.checkConnectivity();
if (reachable.length < this.quorum) {
// Minority partition - operate in degraded mode
this.degradedMode = true;
// Continue with limited functionality
} else {
// Majority partition - normal operation
this.degradedMode = false;
}
}
}
}
Failure Handling:
- Service failures: Circuit breaker, retry with backoff, failover to backup
- Database failures: Read from replicas, promote replica to primary
- Network partitions: Continue in degraded mode, sync when partition heals
Key Takeaways
- Fault tolerance ensures system continues operating despite failures
- Redundancy is key: Multiple copies of critical components
- Failure detection: Health checks, timeouts, monitoring
- Circuit breakers prevent cascading failures
- Graceful degradation: Continue with reduced functionality
- Automatic recovery: Failover, restart, self-healing
- Test failures: Use chaos engineering to test fault tolerance