Topic Overview

Leader Election

Learn how distributed systems elect a leader to coordinate activities and ensure consistency.

Leader election is a fundamental problem in distributed systems where multiple nodes need to agree on which node should act as the coordinator or leader for a particular task.


Why Leader Election?

In distributed systems, having a single leader helps:

  • Coordinate activities: Prevent conflicts in distributed operations
  • Maintain consistency: Leader makes decisions, followers replicate
  • Simplify consensus: Easier than having all nodes agree on every decision
  • Manage resources: Leader allocates work, manages state

Examples:

  • Database primary node (writes go to leader)
  • Distributed lock manager
  • Configuration management
  • Task scheduling

Bully Algorithm

The node with the highest ID becomes the leader.

Process

  1. Node detects leader is down
  2. Sends election message to all nodes with higher ID
  3. If no response → becomes leader, announces to all
  4. If response → waits for new leader announcement

Implementation

class BullyElection {
  private nodes: Node[];
  private myId: number;
  private leaderId: number | null = null;

  async startElection(): Promise<void> {
    const higherNodes = this.nodes.filter(n => n.id > this.myId);
    
    if (higherNodes.length === 0) {
      // I'm the highest - become leader
      await this.becomeLeader();
      return;
    }

    // Send election to higher nodes
    const responses = await Promise.allSettled(
      higherNodes.map(node => this.sendElectionMessage(node))
    );

    // If any higher node responded, wait for their announcement
    const anyResponded = responses.some(r => r.status === 'fulfilled');
    if (anyResponded) {
      // Wait for leader announcement
      await this.waitForLeader();
    } else {
      // No higher node responded - I'm the leader
      await this.becomeLeader();
    }
  }

  private async becomeLeader(): Promise<void> {
    this.leaderId = this.myId;
    // Announce to all nodes
    await Promise.all(
      this.nodes.map(node => this.sendLeaderAnnouncement(node))
    );
  }
}

Pros

  • Simple to understand
  • Fast election (O(n) messages)
  • Deterministic (highest ID always wins)

Cons

  • Requires all nodes to know all other nodes
  • Higher ID nodes always win (may not be best choice)
  • Network partitions can cause multiple leaders

Ring Algorithm

Nodes arranged in logical ring, election message passed around.

Process

  1. Node detects leader is down
  2. Sends election message to next node in ring
  3. Each node adds its ID and forwards
  4. When message returns to initiator → highest ID becomes leader

Implementation

class RingElection {
  private ring: Node[]; // Ordered list
  private myIndex: number;
  private leaderId: number | null = null;

  async startElection(): Promise<void> {
    const electionMessage = {
      type: 'election',
      participants: [this.myId],
      initiator: this.myId
    };

    // Send to next node in ring
    const nextNode = this.ring[(this.myIndex + 1) % this.ring.length];
    await this.sendElectionMessage(nextNode, electionMessage);
  }

  async handleElectionMessage(message: ElectionMessage): Promise<void> {
    if (message.initiator === this.myId) {
      // Message came back to me
      const leaderId = Math.max(...message.participants);
      if (leaderId === this.myId) {
        await this.becomeLeader();
      } else {
        this.leaderId = leaderId;
      }
      return;
    }

    // Add myself and forward
    message.participants.push(this.myId);
    const nextNode = this.ring[(this.myIndex + 1) % this.ring.length];
    await this.sendElectionMessage(nextNode, message);
  }
}

Pros

  • Works with partial network knowledge
  • Deterministic result
  • Handles node failures gracefully

Cons

  • Slower (O(n) time, message travels around ring)
  • Ring must be maintained
  • Single point of failure if ring breaks

Raft Leader Election

Raft uses randomized timeouts and voting to elect leaders.

Key Concepts

  • Terms: Time periods, each has at most one leader
  • Heartbeats: Leader sends periodic heartbeats
  • Timeouts: If follower doesn't receive heartbeat, starts election
  • Voting: Nodes vote for candidate, majority wins

Process

  1. Follower doesn't receive heartbeat → becomes candidate
  2. Candidate increments term, votes for itself
  3. Sends vote requests to all other nodes
  4. If receives majority votes → becomes leader
  5. If another leader found → becomes follower

Implementation

class RaftNode {
  private state: 'follower' | 'candidate' | 'leader' = 'follower';
  private currentTerm: number = 0;
  private votedFor: number | null = null;
  private electionTimeout: number;
  private heartbeatInterval: number;

  async start(): Promise<void> {
    this.scheduleElection();
    this.startHeartbeatListener();
  }

  private scheduleElection(): void {
    // Random timeout (150-300ms)
    this.electionTimeout = setTimeout(() => {
      this.startElection();
    }, 150 + Math.random() * 150);
  }

  private async startElection(): Promise<void> {
    this.state = 'candidate';
    this.currentTerm++;
    this.votedFor = this.myId;

    // Request votes from all nodes
    const votes = await Promise.allSettled(
      this.nodes.map(node => this.requestVote(node))
    );

    const voteCount = votes.filter(v => v.status === 'fulfilled').length;
    
    if (voteCount > this.nodes.length / 2) {
      // Majority voted for me
      this.becomeLeader();
    } else {
      // Lost election, become follower
      this.state = 'follower';
      this.scheduleElection();
    }
  }

  private async requestVote(node: Node): Promise<boolean> {
    const response = await node.requestVote({
      term: this.currentTerm,
      candidateId: this.myId,
      lastLogIndex: this.log.length,
      lastLogTerm: this.log[this.log.length - 1]?.term || 0
    });

    return response.voteGranted;
  }

  private becomeLeader(): void {
    this.state = 'leader';
    this.scheduleElection(); // Cancel, will reset on heartbeat
    this.startSendingHeartbeats();
  }

  private startSendingHeartbeats(): void {
    setInterval(() => {
      this.nodes.forEach(node => {
        this.sendHeartbeat(node);
      });
    }, this.heartbeatInterval);
  }
}

Examples

ZooKeeper Leader Election

// Using ZooKeeper for leader election
public class LeaderElection {
    private ZooKeeper zk;
    private String leaderPath = "/leader";
    private String myPath;

    public void participateInElection() throws Exception {
        // Create ephemeral node
        myPath = zk.create(leaderPath + "/node-", 
                          data.getBytes(),
                          ZooDefs.Ids.OPEN_ACL_UNSAFE,
                          CreateMode.EPHEMERAL_SEQUENTIAL);

        // Watch for leader changes
        checkLeader();
    }

    private void checkLeader() throws Exception {
        List<String> children = zk.getChildren(leaderPath, false);
        Collections.sort(children);
        
        String smallest = children.get(0);
        String leaderPath = this.leaderPath + "/" + smallest;
        
        Stat stat = zk.exists(leaderPath, new Watcher() {
            @Override
            public void process(WatchedEvent event) {
                if (event.getType() == EventType.NodeDeleted) {
                    checkLeader(); // Re-check if leader dies
                }
            }
        });

        if (leaderPath.equals(myPath)) {
            becomeLeader();
        } else {
            becomeFollower();
        }
    }
}

etcd Leader Election

// Using etcd for leader election
func (n *Node) ElectLeader(ctx context.Context) error {
    // Try to acquire lock (leader key)
    lease, err := n.client.Grant(ctx, 10) // 10 second lease
    if err != nil {
        return err
    }

    // Try to put with lease (only succeeds if key doesn't exist)
    txn := n.client.Txn(ctx).
        If(clientv3.Compare(clientv3.CreateRevision("/leader"), "=", 0)).
        Then(clientv3.OpPut("/leader", n.id, clientv3.WithLease(lease.ID))).
        Else(clientv3.OpGet("/leader"))

    txnResp, err := txn.Commit()
    if err != nil {
        return err
    }

    if txnResp.Succeeded {
        // I'm the leader
        n.becomeLeader(lease.ID)
        return nil
    }

    // Someone else is leader
    n.becomeFollower()
    return nil
}

Common Pitfalls

  • Split-brain problem: Network partition causes multiple leaders. Fix: Use quorum, require majority votes
  • Not handling leader failures: System stops when leader dies. Fix: Automatic re-election, health checks
  • Stale leaders: Old leader thinks it's still leader after partition. Fix: Use terms/epochs, leader must prove it has majority
  • Election storms: Multiple nodes start election simultaneously. Fix: Randomized timeouts (Raft), priority-based (Bully)
  • No leader timeout: System waits forever for leader. Fix: Set maximum election timeout, fail fast
  • Ignoring network partitions: Assumes all nodes can communicate. Fix: Design for partition tolerance, use quorum
  • Not persisting election state: Leader info lost on restart. Fix: Store in persistent storage (ZooKeeper, etcd)

Interview Questions

Beginner

Q: What is leader election and why is it needed in distributed systems?

A: Leader election is the process of selecting one node among multiple nodes to act as the coordinator or leader. It's needed because:

  • Coordination: Prevents conflicts when multiple nodes try to perform the same task
  • Consistency: Leader makes decisions, followers replicate (simpler than all nodes agreeing)
  • Resource management: Leader allocates work, manages shared state
  • Simplified consensus: Easier to have one leader than coordinate all nodes

Examples: Database primary node (all writes go to leader), distributed lock manager, configuration coordinator.


Intermediate

Q: Compare the Bully and Ring algorithms for leader election. When would you use each?

A:

Bully Algorithm:

  • Process: Node with highest ID becomes leader
  • Messages: O(n) messages, fast
  • Requirements: All nodes know all other nodes
  • Pros: Simple, fast, deterministic
  • Cons: Higher ID always wins (may not be best), network partitions cause issues

Ring Algorithm:

  • Process: Election message travels around logical ring
  • Messages: O(n) messages, but sequential (slower)
  • Requirements: Logical ring structure
  • Pros: Works with partial knowledge, handles failures
  • Cons: Slower, ring maintenance overhead, single point of failure

When to use:

  • Bully: Small clusters, all nodes known, need fast election
  • Ring: Large clusters, partial network knowledge, ring topology exists

Better alternative: Use Raft or existing systems (ZooKeeper, etcd) that handle edge cases.


Senior

Q: Design a leader election system for a distributed database with 100 nodes across 3 data centers. How do you handle network partitions, ensure only one leader, and minimize election time?

A:

Architecture:

  • Raft-based election with quorum requirements
  • Multi-datacenter awareness (prefer leader in primary DC)
  • Health checks and automatic failover
  • Persistent election state (etcd/ZooKeeper)

Design:

class DistributedLeaderElection {
  private nodes: Map<DataCenter, Node[]>;
  private currentLeader: LeaderInfo | null = null;
  private electionService: RaftElection;

  async electLeader(): Promise<LeaderInfo> {
    // Prefer leader in primary data center
    const primaryDC = this.getPrimaryDataCenter();
    const candidates = this.getHealthyNodes(primaryDC);

    if (candidates.length === 0) {
      // Fallback to secondary DCs
      candidates.push(...this.getHealthyNodesFromSecondaryDCs());
    }

    // Use Raft with quorum
    const leader = await this.raftElection.elect(candidates, {
      quorum: Math.floor(this.getTotalNodes() / 2) + 1,
      timeout: 5000, // 5 second max election time
      retryBackoff: 'exponential'
    });

    return leader;
  }
}

Handling Network Partitions:

  1. Quorum requirement: Leader must have majority votes (51+ nodes)
  2. Split detection: Monitor connectivity, detect partitions
  3. Partition handling:
    • Majority partition: Continues with leader
    • Minority partition: Cannot elect leader, read-only mode
  4. Merge handling: When partitions merge, compare terms, highest term wins

Ensuring Single Leader:

  1. Terms/epochs: Each election increments term, old leaders rejected
  2. Leader lease: Leader must renew lease (heartbeat), expires if no renewal
  3. Fencing tokens: Operations include token, old leader's tokens rejected
  4. Quorum writes: Leader must get majority acknowledgment for writes

Minimizing Election Time:

  1. Randomized timeouts: Prevent election storms (150-300ms random)
  2. Fast failure detection: Health checks every 100ms
  3. Parallel vote requests: Request votes from all nodes simultaneously
  4. Pre-voting: Check if current leader is still alive before starting election
  5. Leader priority: Prefer nodes in primary DC, with better network

Implementation Details:

class OptimizedRaftElection {
  async electLeader(candidates: Node[]): Promise<LeaderInfo> {
    // Pre-vote: Check if current leader is alive
    const currentLeaderAlive = await this.checkLeaderHealth();
    if (currentLeaderAlive) {
      return this.currentLeader; // No election needed
    }

    // Start election with randomized timeout
    const timeout = 150 + Math.random() * 150;
    await this.sleep(timeout);

    // Request votes in parallel
    const votes = await Promise.all(
      candidates.map(node => this.requestVote(node))
    );

    const voteCount = votes.filter(v => v).length;
    const quorum = Math.floor(candidates.length / 2) + 1;

    if (voteCount >= quorum) {
      return await this.becomeLeader();
    }

    // Retry with exponential backoff
    throw new ElectionFailedError('Insufficient votes');
  }
}

Monitoring:

  • Election duration metrics
  • Leader stability (how long leader stays)
  • Partition detection and resolution time
  • False positive elections (leader was alive)

Failover Strategy:

  • Graceful: Current leader steps down, triggers election
  • Ungraceful: Health check fails, automatic election starts
  • Target: < 1 second election time, < 5 second total failover

Key Takeaways

  • Leader election coordinates distributed systems by selecting a single coordinator
  • Bully algorithm: Highest ID wins, simple but requires full network knowledge
  • Ring algorithm: Message travels ring, works with partial knowledge but slower
  • Raft algorithm: Randomized timeouts, voting, handles partitions well
  • Use existing systems (ZooKeeper, etcd) for production - they handle edge cases
  • Quorum requirement prevents split-brain (need majority votes)
  • Terms/epochs ensure old leaders are rejected after network partitions
  • Randomized timeouts prevent election storms when multiple nodes start election
  • Health checks detect leader failures quickly to minimize downtime
  • Network partitions require quorum - minority partition cannot elect leader
  • Leader lease/heartbeat ensures leader is still alive, auto-failover if not

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.