Topic Overview

Fault Tolerance: Concepts, Trade-offs & Failure Modes

Learn how to design systems that continue operating correctly even when components fail.

Senior11 min read

Fault tolerance is the ability of a system to continue operating correctly even when some components fail.


Failure Modes

Crash failures: Node stops responding (most common)

Byzantine failures: Node behaves arbitrarily (malicious or buggy)

Omission failures: Node fails to send/receive messages

Timing failures: Node responds too slowly or too fast


Fault Tolerance Techniques

Redundancy

Replication: Multiple copies of data/services

Active-active: All replicas handle requests

Active-passive: Standby replicas take over on failure

1class RedundantService {
2 private replicas: Service[] = [];
3
4 async handleRequest(request: Request Response

Circuit Breaker

Prevents cascading failures by stopping requests to failing services.

1class CircuitBreaker {
2 private state: 'closed' | 'open' | 'half-open' = 'closed';
3 private failures: number = 0;
4 private lastFailureTime: number = 0;
5
6 async call<T>(fn: () => Promise<T>): Promise<T> {
7 if (this.state === 'open') {
8 if (Date. lastFailureTime timeout

Graceful Degradation

System continues with reduced functionality.

1class DegradableService {
2 async getData(): Promise<Data> {
3 try {
4 return await this.primarySource.get();
5 } catch (error) {
6 // Fallback to cache or simplified response
7 return await this.cache.get() || this.getDefaultData();
8 }
9 }
10}

Examples

Database Replication

1class FaultTolerantDatabase {
2 private primary: Database;
3 private replicas: Database[] = [];
4
5 async write(data: any): Promise<void> {
6 try {
7 await this.primary.write(data);
8 // Async replication to replicas
9 this.replicateAsync(data);
10 } catch (error) {
11 // Primary failed, promote replica
12 await this.promoteReplica(

Common Pitfalls

  • Single point of failure: One component brings down system. Fix: Add redundancy
  • No failure detection: Don't know when components fail. Fix: Health checks, timeouts
  • Cascading failures: One failure causes others. Fix: Circuit breakers, rate limiting
  • Not testing failures: System untested under failure. Fix: Chaos engineering
  • Ignoring partial failures: System fails completely. Fix: Graceful degradation

Interview Questions

Beginner

Q: What is fault tolerance and why is it important?

A: Fault tolerance is the ability of a system to continue operating correctly even when components fail.

Why important:

  • High availability: System stays up even with failures
  • Reliability: Users can depend on the system
  • Resilience: System recovers from failures
  • User experience: Failures don't disrupt users

Example: If one database server fails, system should continue using other servers.


Intermediate

Q: How do you design a fault-tolerant distributed system?

A:

Key techniques:

  1. Redundancy: Multiple copies of critical components
  2. Failure detection: Health checks, timeouts, monitoring
  3. Automatic recovery: Failover, restart failed components
  4. Isolation: Failures don't cascade
  5. Graceful degradation: Continue with reduced functionality

Example design:

  • Load balancer with multiple backend servers
  • Database replication (primary + replicas)
  • Circuit breakers to prevent cascading failures
  • Health checks to detect failures quickly
  • Automatic failover when primary fails

Senior

Q: Design a fault-tolerant microservices architecture. How do you handle service failures, database failures, and network partitions?

A:

Architecture:

  • Service redundancy: Multiple instances of each service
  • Database replication: Primary + replicas
  • Circuit breakers: Prevent cascading failures
  • Health checks: Detect failures quickly
  • Service mesh: Handle communication resilience

Design:

1class FaultTolerantMicroservices {
2 // Service with redundancy
3 class ResilientService {
4 private instances: ServiceInstance[] = [];
5 private circuitBreaker: CircuitBreaker;
6
7 async handleRequest(request: Request): Promise<Response> {
8 return await this.circuitBreaker.call(async () => {
9 // Try healthy instances
10 const healthy = this.instances.filter(i => i.isHealthy()

Failure Handling:

  1. Service failures: Circuit breaker, retry with backoff, failover to backup
  2. Database failures: Read from replicas, promote replica to primary
  3. Network partitions: Continue in degraded mode, sync when partition heals

  • Fault tolerance ensures system continues operating despite failures

  • Redundancy is key: Multiple copies of critical components

  • Failure detection: Health checks, timeouts, monitoring

  • Circuit breakers prevent cascading failures

  • Graceful degradation: Continue with reduced functionality

  • Automatic recovery: Failover, restart, self-healing

  • Test failures: Use chaos engineering to test fault tolerance

  • Heartbeats & Health Checks - Detecting node failures

  • Partition Tolerance - Handling network partitions

  • Leader Election - Electing leaders when nodes fail

  • Replication Lag - Handling replica failures

  • Idempotency - Making operations safe to retry

Key Takeaways

Fault tolerance ensures system continues operating despite failures

Redundancy is key: Multiple copies of critical components

Failure detection: Health checks, timeouts, monitoring

Circuit breakers prevent cascading failures

Graceful degradation: Continue with reduced functionality

Automatic recovery: Failover, restart, self-healing

Test failures: Use chaos engineering to test fault tolerance


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.