Topic Overview
Consensus Algorithms (Raft, Paxos)
Learn how distributed systems achieve consensus among nodes using Raft and Paxos algorithms.
Consensus is the problem of getting multiple nodes in a distributed system to agree on a value, even in the presence of failures. Raft and Paxos are two fundamental consensus algorithms.
What is Consensus?
Consensus algorithms ensure that:
- Agreement: All non-faulty nodes agree on the same value
- Validity: The agreed value was proposed by some node
- Termination: All nodes eventually decide on a value
- Integrity: A node decides at most one value
Use cases: Distributed databases, configuration management, leader election, state machine replication.
Raft Algorithm
Raft is designed to be understandable while providing the same fault tolerance as Paxos.
Key Concepts
Leader-based: One leader handles all client requests and replicates to followers.
Terms: Time periods numbered sequentially. Each term has at most one leader.
Log replication: Leader appends entries to its log, replicates to followers.
Safety: Leader only commits entries that are replicated to majority.
States
- Follower: Passive, responds to leader heartbeats
- Candidate: Seeking votes to become leader
- Leader: Handles client requests, replicates log
Leader Election
1class RaftNode {2 private state: 'follower' | 'candidate' | 'leader' = 'follower';3 private currentTerm: number = 0;4 private votedFor: number | null = null;5 private log: LogEntry[] = [];6 private commitIndex: number = 0;78 async startElection(): Promise<void> {9 this.state = 'candidate';10 this.currentTerm
Log Replication
1async appendEntry(entry: LogEntry): Promise<void> {2 this.log.push({ ...entry, term: this.currentTerm });34 // Replicate to followers5 const responses = await Promise.all(6 this.followers.map(follower => this.replicateLog(follower))7 );89 // Commit if majority acknowledged10 const ackCount = responses.filter(r => r.success)length
Paxos Algorithm
Paxos is the original consensus algorithm, more complex but highly fault-tolerant.
Phases
Phase 1 (Prepare):
- Proposer sends prepare(n) with proposal number n
- Acceptor responds with promise not to accept proposals < n
- If majority promise, proceed
Phase 2 (Accept):
- Proposer sends accept(n, v) with value v
- Acceptor accepts if n >= any promised number
- If majority accept, value is chosen
Implementation
1class PaxosNode {2 private promisedNumber: number = 0;3 private acceptedNumber: number = 0;4 private acceptedValue: any = null;56 async prepare(proposalNumber: number): Promise<PromiseResponse> {7 if (proposalNumber > this.promisedNumber) {8 this.promisedNumber = proposalNumber;9 return {10 promised: true,11 acceptedNumber: this.acceptedNumber,12 acceptedValue: acceptedValue
Examples
Raft: Distributed Key-Value Store
1class DistributedKVStore {2 private raft: RaftNode;3 private state: Map<string, string> = new Map();45 async set(key: string, value: string): Promise<void> {6 const entry: LogEntry = {7 type: 'SET',8 key,9 value,10 term: this.raft.currentTerm11 };1213 await this.raft.appendEntryentry
Paxos: Configuration Management
1class ConfigManager {2 async proposeConfig(config: Config): Promise<void> {3 let proposalNumber = this.generateProposalNumber();4 let value = config;56 while (true) {7 // Phase 1: Prepare8 const promises = await Promise.all(9 this.acceptors.map(a => a.prepare(proposalNumber))10 );1112 const majority = promises.p ppromisedlength
Common Pitfalls
- Split-brain in Raft: Network partition can cause multiple leaders. Fix: Require majority votes, use odd number of nodes
- Paxos complexity: Hard to understand and implement correctly. Fix: Use Raft for most cases, or use existing libraries
- Not handling concurrent proposals: Multiple proposers can conflict. Fix: Use unique proposal numbers, leader-based approach
- Ignoring log consistency: Follower logs must match leader. Fix: Check last log term/index during election
- Not committing safely: Committing before majority can cause inconsistency. Fix: Only commit after majority acknowledgment
- Term confusion: Old terms can cause issues. Fix: Always check and update terms, reject stale messages
- Not handling failures: Algorithm must work with node failures. Fix: Require majority, handle timeouts gracefully
Interview Questions
Beginner
Q: What is consensus and why is it needed in distributed systems?
A: Consensus is the problem of getting multiple nodes to agree on a value. It's needed because:
- Coordination: Multiple nodes need to agree on shared state (e.g., database value, configuration)
- Consistency: Ensures all nodes see the same data
- Fault tolerance: System continues working even if some nodes fail
- Ordering: Agree on the order of operations (critical for state machines)
Without consensus, nodes might have different views of the system, leading to inconsistencies and conflicts.
Intermediate
Q: Compare Raft and Paxos. When would you choose each?
A:
Raft:
- Simplicity: Designed to be understandable, easier to implement
- Leader-based: Single leader handles all requests, simpler model
- Strong leader: Leader has full authority, no conflicts
- Use when: Building new systems, need understandable consensus, want easier debugging
Paxos:
- Complexity: More complex, harder to understand and implement
- No leader: Any node can propose, more flexible
- Proven: Original consensus algorithm, well-studied
- Use when: Need maximum flexibility, building on existing Paxos infrastructure
Comparison:
- Understandability: Raft wins (designed for this)
- Performance: Similar (both require majority)
- Fault tolerance: Similar (both handle minority failures)
- Flexibility: Paxos wins (no leader requirement)
Recommendation: Use Raft for most cases. Only use Paxos if you need leaderless consensus or are building on existing Paxos systems.
Senior
Q: Design a distributed database using Raft for consensus. How do you handle read operations, write operations, and ensure linearizability? How do you handle network partitions?
A:
Architecture:
- Raft cluster: 5 nodes (3-node minimum, 5 for better fault tolerance)
- Leader: Handles all writes, replicates to followers
- Reads: Can go to leader (strong consistency) or followers (eventual consistency)
Write Operations:
1class DistributedDB {2 async write(key: string, value: string): Promise<void> {3 if (this.raft.state !== 'leader') {4 throw new Error('Not leader, redirect to leader');5 }67 const entry: LogEntry = {8 type: 'WRITE',9 key,10 value,11 term: this.raft.currentTerm,12 index: this.raft.log.length
Read Operations:
1// Option 1: Read from leader (strong consistency)2async read(key: string): Promise<string> {3 if (this.raft.state !== 'leader') {4 // Redirect to leader5 const leader = await this.findLeader();6 return await leader.read(key);7 }89 // Leader can serve reads directly (linearizable)10 return this.stateMachine.get(key);11}1213// Option 2: Read from followers (eventual consistency, faster)14 key
Linearizability:
- Writes: Always go through leader, committed only after majority
- Reads: Read from leader for linearizability, or use read leases
- Sequence numbers: Each operation gets sequence number, maintain ordering
- Fencing tokens: Use tokens to prevent stale reads
Network Partitions:
- Detection: Heartbeat timeouts, no majority responses
- Minority partition: Cannot elect leader, becomes read-only
- Majority partition: Continues operating, can elect new leader
- Merge handling:
- Compare terms, highest term wins
- Leader with higher term forces followers to update
- Resolve conflicts using last-write-wins or application logic
Implementation:
1class PartitionAwareRaft {2 async handlePartition(): Promise<void> {3 // Check if we have majority4 const responses = await Promise.allSettled(5 this.nodes.map(n => this.sendHeartbeat(n))6 );78 const aliveCount = responses.filter(r => r.status === 'fulfilled').length;9 const quorum = Math.floor(this.nodes.length / 2
Optimizations:
- Batching: Batch multiple writes in single log entry
- Read replicas: Use followers for read scaling (with consistency trade-offs)
- Snapshotting: Periodically snapshot state, truncate log
- Compression: Compress log entries to reduce network traffic
-
Consensus ensures agreement among distributed nodes on a value, critical for consistency
-
Raft is simpler than Paxos, designed for understandability while maintaining fault tolerance
-
Paxos is more flexible but complex, allows any node to propose
-
Leader-based approach (Raft) simplifies consensus but creates single point of coordination
-
Majority requirement ensures fault tolerance - system works with up to (n-1)/2 failures
-
Log replication ensures all nodes have same sequence of operations
-
Terms/epochs prevent stale leaders and ensure safety
-
Network partitions require quorum - minority partition cannot make progress
-
Linearizability requires reads to go through leader or use read leases
-
Choose Raft for most use cases due to simplicity, use Paxos only if you need leaderless consensus
-
Leader Election - How nodes elect a leader for coordination
-
Fault Tolerance - Handling node failures in distributed systems
-
Partition Tolerance - CAP theorem and handling network partitions
-
Two-Phase Commit (2PC) - Another consensus protocol for distributed transactions
-
Distributed Transactions - Maintaining ACID properties across nodes
Key Takeaways
Consensus ensures agreement among distributed nodes on a value, critical for consistency
Raft is simpler than Paxos, designed for understandability while maintaining fault tolerance
Paxos is more flexible but complex, allows any node to propose
Leader-based approach (Raft) simplifies consensus but creates single point of coordination
Majority requirement ensures fault tolerance - system works with up to (n-1)/2 failures
Log replication ensures all nodes have same sequence of operations
Terms/epochs prevent stale leaders and ensure safety
Network partitions require quorum - minority partition cannot make progress
Linearizability requires reads to go through leader or use read leases
Choose Raft for most use cases due to simplicity, use Paxos only if you need leaderless consensus
Related Topics
Leader Election
How nodes elect a leader for coordination
Fault Tolerance
Handling node failures in distributed systems
Partition Tolerance
CAP theorem and handling network partitions
Two-Phase Commit (2PC)
Another consensus protocol for distributed transactions
Distributed Transactions
Maintaining ACID properties across nodes
What's next?