Topic Overview

Consensus Algorithms (Raft, Paxos)

Learn how distributed systems achieve consensus among nodes using Raft and Paxos algorithms.

Consensus is the problem of getting multiple nodes in a distributed system to agree on a value, even in the presence of failures. Raft and Paxos are two fundamental consensus algorithms.


What is Consensus?

Consensus algorithms ensure that:

  • Agreement: All non-faulty nodes agree on the same value
  • Validity: The agreed value was proposed by some node
  • Termination: All nodes eventually decide on a value
  • Integrity: A node decides at most one value

Use cases: Distributed databases, configuration management, leader election, state machine replication.


Raft Algorithm

Raft is designed to be understandable while providing the same fault tolerance as Paxos.

Key Concepts

Leader-based: One leader handles all client requests and replicates to followers.

Terms: Time periods numbered sequentially. Each term has at most one leader.

Log replication: Leader appends entries to its log, replicates to followers.

Safety: Leader only commits entries that are replicated to majority.

States

  1. Follower: Passive, responds to leader heartbeats
  2. Candidate: Seeking votes to become leader
  3. Leader: Handles client requests, replicates log

Leader Election

class RaftNode {
  private state: 'follower' | 'candidate' | 'leader' = 'follower';
  private currentTerm: number = 0;
  private votedFor: number | null = null;
  private log: LogEntry[] = [];
  private commitIndex: number = 0;

  async startElection(): Promise<void> {
    this.state = 'candidate';
    this.currentTerm++;
    this.votedFor = this.myId;

    const votes = await Promise.all(
      this.nodes.map(node => this.requestVote(node))
    );

    const voteCount = votes.filter(v => v).length;
    if (voteCount > this.nodes.length / 2) {
      this.becomeLeader();
    }
  }

  async requestVote(node: Node): Promise<boolean> {
    const response = await node.requestVote({
      term: this.currentTerm,
      candidateId: this.myId,
      lastLogIndex: this.log.length,
      lastLogTerm: this.log[this.log.length - 1]?.term || 0
    });

    if (response.term > this.currentTerm) {
      this.currentTerm = response.term;
      this.state = 'follower';
      return false;
    }

    return response.voteGranted;
  }
}

Log Replication

async appendEntry(entry: LogEntry): Promise<void> {
  this.log.push({ ...entry, term: this.currentTerm });

  // Replicate to followers
  const responses = await Promise.all(
    this.followers.map(follower => this.replicateLog(follower))
  );

  // Commit if majority acknowledged
  const ackCount = responses.filter(r => r.success).length;
  if (ackCount > this.followers.length / 2) {
    this.commitIndex = this.log.length - 1;
    this.applyToStateMachine();
  }
}

Paxos Algorithm

Paxos is the original consensus algorithm, more complex but highly fault-tolerant.

Phases

Phase 1 (Prepare):

  1. Proposer sends prepare(n) with proposal number n
  2. Acceptor responds with promise not to accept proposals < n
  3. If majority promise, proceed

Phase 2 (Accept):

  1. Proposer sends accept(n, v) with value v
  2. Acceptor accepts if n >= any promised number
  3. If majority accept, value is chosen

Implementation

class PaxosNode {
  private promisedNumber: number = 0;
  private acceptedNumber: number = 0;
  private acceptedValue: any = null;

  async prepare(proposalNumber: number): Promise<PromiseResponse> {
    if (proposalNumber > this.promisedNumber) {
      this.promisedNumber = proposalNumber;
      return {
        promised: true,
        acceptedNumber: this.acceptedNumber,
        acceptedValue: this.acceptedValue
      };
    }
    return { promised: false };
  }

  async accept(proposalNumber: number, value: any): Promise<boolean> {
    if (proposalNumber >= this.promisedNumber) {
      this.promisedNumber = proposalNumber;
      this.acceptedNumber = proposalNumber;
      this.acceptedValue = value;
      return true;
    }
    return false;
  }
}

Examples

Raft: Distributed Key-Value Store

class DistributedKVStore {
  private raft: RaftNode;
  private state: Map<string, string> = new Map();

  async set(key: string, value: string): Promise<void> {
    const entry: LogEntry = {
      type: 'SET',
      key,
      value,
      term: this.raft.currentTerm
    };

    await this.raft.appendEntry(entry);
    // Entry will be applied to state machine once committed
  }

  applyEntry(entry: LogEntry): void {
    if (entry.type === 'SET') {
      this.state.set(entry.key, entry.value);
    } else if (entry.type === 'DELETE') {
      this.state.delete(entry.key);
    }
  }
}

Paxos: Configuration Management

class ConfigManager {
  async proposeConfig(config: Config): Promise<void> {
    let proposalNumber = this.generateProposalNumber();
    let value = config;

    while (true) {
      // Phase 1: Prepare
      const promises = await Promise.all(
        this.acceptors.map(a => a.prepare(proposalNumber))
      );

      const majority = promises.filter(p => p.promised).length;
      if (majority > this.acceptors.length / 2) {
        // Use highest accepted value if any
        const accepted = promises
          .filter(p => p.acceptedValue)
          .sort((a, b) => b.acceptedNumber - a.acceptedNumber)[0];
        
        if (accepted) {
          value = accepted.acceptedValue;
        }

        // Phase 2: Accept
        const accepts = await Promise.all(
          this.acceptors.map(a => a.accept(proposalNumber, value))
        );

        const acceptedCount = accepts.filter(a => a).length;
        if (acceptedCount > this.acceptors.length / 2) {
          // Value chosen!
          this.config = value;
          return;
        }
      }

      // Retry with higher proposal number
      proposalNumber = this.generateProposalNumber();
    }
  }
}

Common Pitfalls

  • Split-brain in Raft: Network partition can cause multiple leaders. Fix: Require majority votes, use odd number of nodes
  • Paxos complexity: Hard to understand and implement correctly. Fix: Use Raft for most cases, or use existing libraries
  • Not handling concurrent proposals: Multiple proposers can conflict. Fix: Use unique proposal numbers, leader-based approach
  • Ignoring log consistency: Follower logs must match leader. Fix: Check last log term/index during election
  • Not committing safely: Committing before majority can cause inconsistency. Fix: Only commit after majority acknowledgment
  • Term confusion: Old terms can cause issues. Fix: Always check and update terms, reject stale messages
  • Not handling failures: Algorithm must work with node failures. Fix: Require majority, handle timeouts gracefully

Interview Questions

Beginner

Q: What is consensus and why is it needed in distributed systems?

A: Consensus is the problem of getting multiple nodes to agree on a value. It's needed because:

  • Coordination: Multiple nodes need to agree on shared state (e.g., database value, configuration)
  • Consistency: Ensures all nodes see the same data
  • Fault tolerance: System continues working even if some nodes fail
  • Ordering: Agree on the order of operations (critical for state machines)

Without consensus, nodes might have different views of the system, leading to inconsistencies and conflicts.


Intermediate

Q: Compare Raft and Paxos. When would you choose each?

A:

Raft:

  • Simplicity: Designed to be understandable, easier to implement
  • Leader-based: Single leader handles all requests, simpler model
  • Strong leader: Leader has full authority, no conflicts
  • Use when: Building new systems, need understandable consensus, want easier debugging

Paxos:

  • Complexity: More complex, harder to understand and implement
  • No leader: Any node can propose, more flexible
  • Proven: Original consensus algorithm, well-studied
  • Use when: Need maximum flexibility, building on existing Paxos infrastructure

Comparison:

  • Understandability: Raft wins (designed for this)
  • Performance: Similar (both require majority)
  • Fault tolerance: Similar (both handle minority failures)
  • Flexibility: Paxos wins (no leader requirement)

Recommendation: Use Raft for most cases. Only use Paxos if you need leaderless consensus or are building on existing Paxos systems.


Senior

Q: Design a distributed database using Raft for consensus. How do you handle read operations, write operations, and ensure linearizability? How do you handle network partitions?

A:

Architecture:

  • Raft cluster: 5 nodes (3-node minimum, 5 for better fault tolerance)
  • Leader: Handles all writes, replicates to followers
  • Reads: Can go to leader (strong consistency) or followers (eventual consistency)

Write Operations:

class DistributedDB {
  async write(key: string, value: string): Promise<void> {
    if (this.raft.state !== 'leader') {
      throw new Error('Not leader, redirect to leader');
    }

    const entry: LogEntry = {
      type: 'WRITE',
      key,
      value,
      term: this.raft.currentTerm,
      index: this.raft.log.length
    };

    // Append to log
    this.raft.log.push(entry);

    // Replicate to followers
    const responses = await Promise.all(
      this.followers.map(f => this.replicateEntry(f, entry))
    );

    // Commit if majority acknowledged
    const ackCount = responses.filter(r => r.success).length;
    if (ackCount >= this.quorum) {
      this.commitIndex = entry.index;
      this.applyToStateMachine(entry);
      return;
    }

    throw new Error('Failed to replicate to majority');
  }
}

Read Operations:

// Option 1: Read from leader (strong consistency)
async read(key: string): Promise<string> {
  if (this.raft.state !== 'leader') {
    // Redirect to leader
    const leader = await this.findLeader();
    return await leader.read(key);
  }

  // Leader can serve reads directly (linearizable)
  return this.stateMachine.get(key);
}

// Option 2: Read from followers (eventual consistency, faster)
async readEventually(key: string): Promise<string> {
  // Can read from any node, but may see stale data
  return this.stateMachine.get(key);
}

// Option 3: Lease-based reads (balance consistency and performance)
async readWithLease(key: string): Promise<string> {
  // Leader grants read lease, followers can serve reads during lease
  if (this.hasValidLease()) {
    return this.stateMachine.get(key);
  }
  // Lease expired, must read from leader
  return await this.read(key);
}

Linearizability:

  • Writes: Always go through leader, committed only after majority
  • Reads: Read from leader for linearizability, or use read leases
  • Sequence numbers: Each operation gets sequence number, maintain ordering
  • Fencing tokens: Use tokens to prevent stale reads

Network Partitions:

  1. Detection: Heartbeat timeouts, no majority responses
  2. Minority partition: Cannot elect leader, becomes read-only
  3. Majority partition: Continues operating, can elect new leader
  4. Merge handling:
    • Compare terms, highest term wins
    • Leader with higher term forces followers to update
    • Resolve conflicts using last-write-wins or application logic

Implementation:

class PartitionAwareRaft {
  async handlePartition(): Promise<void> {
    // Check if we have majority
    const responses = await Promise.allSettled(
      this.nodes.map(n => this.sendHeartbeat(n))
    );

    const aliveCount = responses.filter(r => r.status === 'fulfilled').length;
    const quorum = Math.floor(this.nodes.length / 2) + 1;

    if (aliveCount < quorum) {
      // Minority partition - step down, become read-only
      if (this.state === 'leader') {
        this.state = 'follower';
        this.readOnly = true;
      }
    } else {
      // Majority partition - can operate normally
      this.readOnly = false;
    }
  }

  async mergePartitions(otherPartition: RaftNode[]): Promise<void> {
    // Compare terms
    const myTerm = this.currentTerm;
    const otherTerm = Math.max(...otherPartition.map(n => n.currentTerm));

    if (otherTerm > myTerm) {
      // Other partition has higher term, update
      this.currentTerm = otherTerm;
      this.state = 'follower';
      await this.syncLog(otherPartition);
    } else if (myTerm > otherTerm) {
      // We have higher term, force others to update
      await Promise.all(
        otherPartition.map(n => n.syncLog([this]))
      );
    }
  }
}

Optimizations:

  • Batching: Batch multiple writes in single log entry
  • Read replicas: Use followers for read scaling (with consistency trade-offs)
  • Snapshotting: Periodically snapshot state, truncate log
  • Compression: Compress log entries to reduce network traffic

Key Takeaways

  • Consensus ensures agreement among distributed nodes on a value, critical for consistency
  • Raft is simpler than Paxos, designed for understandability while maintaining fault tolerance
  • Paxos is more flexible but complex, allows any node to propose
  • Leader-based approach (Raft) simplifies consensus but creates single point of coordination
  • Majority requirement ensures fault tolerance - system works with up to (n-1)/2 failures
  • Log replication ensures all nodes have same sequence of operations
  • Terms/epochs prevent stale leaders and ensure safety
  • Network partitions require quorum - minority partition cannot make progress
  • Linearizability requires reads to go through leader or use read leases
  • Choose Raft for most use cases due to simplicity, use Paxos only if you need leaderless consensus

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.