Topic Overview
Consensus Algorithms (Raft, Paxos)
Learn how distributed systems achieve consensus among nodes using Raft and Paxos algorithms.
Consensus is the problem of getting multiple nodes in a distributed system to agree on a value, even in the presence of failures. Raft and Paxos are two fundamental consensus algorithms.
What is Consensus?
Consensus algorithms ensure that:
- Agreement: All non-faulty nodes agree on the same value
- Validity: The agreed value was proposed by some node
- Termination: All nodes eventually decide on a value
- Integrity: A node decides at most one value
Use cases: Distributed databases, configuration management, leader election, state machine replication.
Raft Algorithm
Raft is designed to be understandable while providing the same fault tolerance as Paxos.
Key Concepts
Leader-based: One leader handles all client requests and replicates to followers.
Terms: Time periods numbered sequentially. Each term has at most one leader.
Log replication: Leader appends entries to its log, replicates to followers.
Safety: Leader only commits entries that are replicated to majority.
States
- Follower: Passive, responds to leader heartbeats
- Candidate: Seeking votes to become leader
- Leader: Handles client requests, replicates log
Leader Election
class RaftNode {
private state: 'follower' | 'candidate' | 'leader' = 'follower';
private currentTerm: number = 0;
private votedFor: number | null = null;
private log: LogEntry[] = [];
private commitIndex: number = 0;
async startElection(): Promise<void> {
this.state = 'candidate';
this.currentTerm++;
this.votedFor = this.myId;
const votes = await Promise.all(
this.nodes.map(node => this.requestVote(node))
);
const voteCount = votes.filter(v => v).length;
if (voteCount > this.nodes.length / 2) {
this.becomeLeader();
}
}
async requestVote(node: Node): Promise<boolean> {
const response = await node.requestVote({
term: this.currentTerm,
candidateId: this.myId,
lastLogIndex: this.log.length,
lastLogTerm: this.log[this.log.length - 1]?.term || 0
});
if (response.term > this.currentTerm) {
this.currentTerm = response.term;
this.state = 'follower';
return false;
}
return response.voteGranted;
}
}
Log Replication
async appendEntry(entry: LogEntry): Promise<void> {
this.log.push({ ...entry, term: this.currentTerm });
// Replicate to followers
const responses = await Promise.all(
this.followers.map(follower => this.replicateLog(follower))
);
// Commit if majority acknowledged
const ackCount = responses.filter(r => r.success).length;
if (ackCount > this.followers.length / 2) {
this.commitIndex = this.log.length - 1;
this.applyToStateMachine();
}
}
Paxos Algorithm
Paxos is the original consensus algorithm, more complex but highly fault-tolerant.
Phases
Phase 1 (Prepare):
- Proposer sends prepare(n) with proposal number n
- Acceptor responds with promise not to accept proposals < n
- If majority promise, proceed
Phase 2 (Accept):
- Proposer sends accept(n, v) with value v
- Acceptor accepts if n >= any promised number
- If majority accept, value is chosen
Implementation
class PaxosNode {
private promisedNumber: number = 0;
private acceptedNumber: number = 0;
private acceptedValue: any = null;
async prepare(proposalNumber: number): Promise<PromiseResponse> {
if (proposalNumber > this.promisedNumber) {
this.promisedNumber = proposalNumber;
return {
promised: true,
acceptedNumber: this.acceptedNumber,
acceptedValue: this.acceptedValue
};
}
return { promised: false };
}
async accept(proposalNumber: number, value: any): Promise<boolean> {
if (proposalNumber >= this.promisedNumber) {
this.promisedNumber = proposalNumber;
this.acceptedNumber = proposalNumber;
this.acceptedValue = value;
return true;
}
return false;
}
}
Examples
Raft: Distributed Key-Value Store
class DistributedKVStore {
private raft: RaftNode;
private state: Map<string, string> = new Map();
async set(key: string, value: string): Promise<void> {
const entry: LogEntry = {
type: 'SET',
key,
value,
term: this.raft.currentTerm
};
await this.raft.appendEntry(entry);
// Entry will be applied to state machine once committed
}
applyEntry(entry: LogEntry): void {
if (entry.type === 'SET') {
this.state.set(entry.key, entry.value);
} else if (entry.type === 'DELETE') {
this.state.delete(entry.key);
}
}
}
Paxos: Configuration Management
class ConfigManager {
async proposeConfig(config: Config): Promise<void> {
let proposalNumber = this.generateProposalNumber();
let value = config;
while (true) {
// Phase 1: Prepare
const promises = await Promise.all(
this.acceptors.map(a => a.prepare(proposalNumber))
);
const majority = promises.filter(p => p.promised).length;
if (majority > this.acceptors.length / 2) {
// Use highest accepted value if any
const accepted = promises
.filter(p => p.acceptedValue)
.sort((a, b) => b.acceptedNumber - a.acceptedNumber)[0];
if (accepted) {
value = accepted.acceptedValue;
}
// Phase 2: Accept
const accepts = await Promise.all(
this.acceptors.map(a => a.accept(proposalNumber, value))
);
const acceptedCount = accepts.filter(a => a).length;
if (acceptedCount > this.acceptors.length / 2) {
// Value chosen!
this.config = value;
return;
}
}
// Retry with higher proposal number
proposalNumber = this.generateProposalNumber();
}
}
}
Common Pitfalls
- Split-brain in Raft: Network partition can cause multiple leaders. Fix: Require majority votes, use odd number of nodes
- Paxos complexity: Hard to understand and implement correctly. Fix: Use Raft for most cases, or use existing libraries
- Not handling concurrent proposals: Multiple proposers can conflict. Fix: Use unique proposal numbers, leader-based approach
- Ignoring log consistency: Follower logs must match leader. Fix: Check last log term/index during election
- Not committing safely: Committing before majority can cause inconsistency. Fix: Only commit after majority acknowledgment
- Term confusion: Old terms can cause issues. Fix: Always check and update terms, reject stale messages
- Not handling failures: Algorithm must work with node failures. Fix: Require majority, handle timeouts gracefully
Interview Questions
Beginner
Q: What is consensus and why is it needed in distributed systems?
A: Consensus is the problem of getting multiple nodes to agree on a value. It's needed because:
- Coordination: Multiple nodes need to agree on shared state (e.g., database value, configuration)
- Consistency: Ensures all nodes see the same data
- Fault tolerance: System continues working even if some nodes fail
- Ordering: Agree on the order of operations (critical for state machines)
Without consensus, nodes might have different views of the system, leading to inconsistencies and conflicts.
Intermediate
Q: Compare Raft and Paxos. When would you choose each?
A:
Raft:
- Simplicity: Designed to be understandable, easier to implement
- Leader-based: Single leader handles all requests, simpler model
- Strong leader: Leader has full authority, no conflicts
- Use when: Building new systems, need understandable consensus, want easier debugging
Paxos:
- Complexity: More complex, harder to understand and implement
- No leader: Any node can propose, more flexible
- Proven: Original consensus algorithm, well-studied
- Use when: Need maximum flexibility, building on existing Paxos infrastructure
Comparison:
- Understandability: Raft wins (designed for this)
- Performance: Similar (both require majority)
- Fault tolerance: Similar (both handle minority failures)
- Flexibility: Paxos wins (no leader requirement)
Recommendation: Use Raft for most cases. Only use Paxos if you need leaderless consensus or are building on existing Paxos systems.
Senior
Q: Design a distributed database using Raft for consensus. How do you handle read operations, write operations, and ensure linearizability? How do you handle network partitions?
A:
Architecture:
- Raft cluster: 5 nodes (3-node minimum, 5 for better fault tolerance)
- Leader: Handles all writes, replicates to followers
- Reads: Can go to leader (strong consistency) or followers (eventual consistency)
Write Operations:
class DistributedDB {
async write(key: string, value: string): Promise<void> {
if (this.raft.state !== 'leader') {
throw new Error('Not leader, redirect to leader');
}
const entry: LogEntry = {
type: 'WRITE',
key,
value,
term: this.raft.currentTerm,
index: this.raft.log.length
};
// Append to log
this.raft.log.push(entry);
// Replicate to followers
const responses = await Promise.all(
this.followers.map(f => this.replicateEntry(f, entry))
);
// Commit if majority acknowledged
const ackCount = responses.filter(r => r.success).length;
if (ackCount >= this.quorum) {
this.commitIndex = entry.index;
this.applyToStateMachine(entry);
return;
}
throw new Error('Failed to replicate to majority');
}
}
Read Operations:
// Option 1: Read from leader (strong consistency)
async read(key: string): Promise<string> {
if (this.raft.state !== 'leader') {
// Redirect to leader
const leader = await this.findLeader();
return await leader.read(key);
}
// Leader can serve reads directly (linearizable)
return this.stateMachine.get(key);
}
// Option 2: Read from followers (eventual consistency, faster)
async readEventually(key: string): Promise<string> {
// Can read from any node, but may see stale data
return this.stateMachine.get(key);
}
// Option 3: Lease-based reads (balance consistency and performance)
async readWithLease(key: string): Promise<string> {
// Leader grants read lease, followers can serve reads during lease
if (this.hasValidLease()) {
return this.stateMachine.get(key);
}
// Lease expired, must read from leader
return await this.read(key);
}
Linearizability:
- Writes: Always go through leader, committed only after majority
- Reads: Read from leader for linearizability, or use read leases
- Sequence numbers: Each operation gets sequence number, maintain ordering
- Fencing tokens: Use tokens to prevent stale reads
Network Partitions:
- Detection: Heartbeat timeouts, no majority responses
- Minority partition: Cannot elect leader, becomes read-only
- Majority partition: Continues operating, can elect new leader
- Merge handling:
- Compare terms, highest term wins
- Leader with higher term forces followers to update
- Resolve conflicts using last-write-wins or application logic
Implementation:
class PartitionAwareRaft {
async handlePartition(): Promise<void> {
// Check if we have majority
const responses = await Promise.allSettled(
this.nodes.map(n => this.sendHeartbeat(n))
);
const aliveCount = responses.filter(r => r.status === 'fulfilled').length;
const quorum = Math.floor(this.nodes.length / 2) + 1;
if (aliveCount < quorum) {
// Minority partition - step down, become read-only
if (this.state === 'leader') {
this.state = 'follower';
this.readOnly = true;
}
} else {
// Majority partition - can operate normally
this.readOnly = false;
}
}
async mergePartitions(otherPartition: RaftNode[]): Promise<void> {
// Compare terms
const myTerm = this.currentTerm;
const otherTerm = Math.max(...otherPartition.map(n => n.currentTerm));
if (otherTerm > myTerm) {
// Other partition has higher term, update
this.currentTerm = otherTerm;
this.state = 'follower';
await this.syncLog(otherPartition);
} else if (myTerm > otherTerm) {
// We have higher term, force others to update
await Promise.all(
otherPartition.map(n => n.syncLog([this]))
);
}
}
}
Optimizations:
- Batching: Batch multiple writes in single log entry
- Read replicas: Use followers for read scaling (with consistency trade-offs)
- Snapshotting: Periodically snapshot state, truncate log
- Compression: Compress log entries to reduce network traffic
Key Takeaways
- Consensus ensures agreement among distributed nodes on a value, critical for consistency
- Raft is simpler than Paxos, designed for understandability while maintaining fault tolerance
- Paxos is more flexible but complex, allows any node to propose
- Leader-based approach (Raft) simplifies consensus but creates single point of coordination
- Majority requirement ensures fault tolerance - system works with up to (n-1)/2 failures
- Log replication ensures all nodes have same sequence of operations
- Terms/epochs prevent stale leaders and ensure safety
- Network partitions require quorum - minority partition cannot make progress
- Linearizability requires reads to go through leader or use read leases
- Choose Raft for most use cases due to simplicity, use Paxos only if you need leaderless consensus