Topic Overview
Leader Election
Learn how distributed systems elect a leader to coordinate activities and ensure consistency.
Leader election is a fundamental problem in distributed systems where multiple nodes need to agree on which node should act as the coordinator or leader for a particular task.
Why Leader Election?
In distributed systems, having a single leader helps:
- Coordinate activities: Prevent conflicts in distributed operations
- Maintain consistency: Leader makes decisions, followers replicate
- Simplify consensus: Easier than having all nodes agree on every decision
- Manage resources: Leader allocates work, manages state
Examples:
- Database primary node (writes go to leader)
- Distributed lock manager
- Configuration management
- Task scheduling
Bully Algorithm
The node with the highest ID becomes the leader.
Process
- Node detects leader is down
- Sends election message to all nodes with higher ID
- If no response → becomes leader, announces to all
- If response → waits for new leader announcement
Implementation
class BullyElection {
private nodes: Node[];
private myId: number;
private leaderId: number | null = null;
async startElection(): Promise<void> {
const higherNodes = this.nodes.filter(n => n.id > this.myId);
if (higherNodes.length === 0) {
// I'm the highest - become leader
await this.becomeLeader();
return;
}
// Send election to higher nodes
const responses = await Promise.allSettled(
higherNodes.map(node => this.sendElectionMessage(node))
);
// If any higher node responded, wait for their announcement
const anyResponded = responses.some(r => r.status === 'fulfilled');
if (anyResponded) {
// Wait for leader announcement
await this.waitForLeader();
} else {
// No higher node responded - I'm the leader
await this.becomeLeader();
}
}
private async becomeLeader(): Promise<void> {
this.leaderId = this.myId;
// Announce to all nodes
await Promise.all(
this.nodes.map(node => this.sendLeaderAnnouncement(node))
);
}
}
Pros
- Simple to understand
- Fast election (O(n) messages)
- Deterministic (highest ID always wins)
Cons
- Requires all nodes to know all other nodes
- Higher ID nodes always win (may not be best choice)
- Network partitions can cause multiple leaders
Ring Algorithm
Nodes arranged in logical ring, election message passed around.
Process
- Node detects leader is down
- Sends election message to next node in ring
- Each node adds its ID and forwards
- When message returns to initiator → highest ID becomes leader
Implementation
class RingElection {
private ring: Node[]; // Ordered list
private myIndex: number;
private leaderId: number | null = null;
async startElection(): Promise<void> {
const electionMessage = {
type: 'election',
participants: [this.myId],
initiator: this.myId
};
// Send to next node in ring
const nextNode = this.ring[(this.myIndex + 1) % this.ring.length];
await this.sendElectionMessage(nextNode, electionMessage);
}
async handleElectionMessage(message: ElectionMessage): Promise<void> {
if (message.initiator === this.myId) {
// Message came back to me
const leaderId = Math.max(...message.participants);
if (leaderId === this.myId) {
await this.becomeLeader();
} else {
this.leaderId = leaderId;
}
return;
}
// Add myself and forward
message.participants.push(this.myId);
const nextNode = this.ring[(this.myIndex + 1) % this.ring.length];
await this.sendElectionMessage(nextNode, message);
}
}
Pros
- Works with partial network knowledge
- Deterministic result
- Handles node failures gracefully
Cons
- Slower (O(n) time, message travels around ring)
- Ring must be maintained
- Single point of failure if ring breaks
Raft Leader Election
Raft uses randomized timeouts and voting to elect leaders.
Key Concepts
- Terms: Time periods, each has at most one leader
- Heartbeats: Leader sends periodic heartbeats
- Timeouts: If follower doesn't receive heartbeat, starts election
- Voting: Nodes vote for candidate, majority wins
Process
- Follower doesn't receive heartbeat → becomes candidate
- Candidate increments term, votes for itself
- Sends vote requests to all other nodes
- If receives majority votes → becomes leader
- If another leader found → becomes follower
Implementation
class RaftNode {
private state: 'follower' | 'candidate' | 'leader' = 'follower';
private currentTerm: number = 0;
private votedFor: number | null = null;
private electionTimeout: number;
private heartbeatInterval: number;
async start(): Promise<void> {
this.scheduleElection();
this.startHeartbeatListener();
}
private scheduleElection(): void {
// Random timeout (150-300ms)
this.electionTimeout = setTimeout(() => {
this.startElection();
}, 150 + Math.random() * 150);
}
private async startElection(): Promise<void> {
this.state = 'candidate';
this.currentTerm++;
this.votedFor = this.myId;
// Request votes from all nodes
const votes = await Promise.allSettled(
this.nodes.map(node => this.requestVote(node))
);
const voteCount = votes.filter(v => v.status === 'fulfilled').length;
if (voteCount > this.nodes.length / 2) {
// Majority voted for me
this.becomeLeader();
} else {
// Lost election, become follower
this.state = 'follower';
this.scheduleElection();
}
}
private async requestVote(node: Node): Promise<boolean> {
const response = await node.requestVote({
term: this.currentTerm,
candidateId: this.myId,
lastLogIndex: this.log.length,
lastLogTerm: this.log[this.log.length - 1]?.term || 0
});
return response.voteGranted;
}
private becomeLeader(): void {
this.state = 'leader';
this.scheduleElection(); // Cancel, will reset on heartbeat
this.startSendingHeartbeats();
}
private startSendingHeartbeats(): void {
setInterval(() => {
this.nodes.forEach(node => {
this.sendHeartbeat(node);
});
}, this.heartbeatInterval);
}
}
Examples
ZooKeeper Leader Election
// Using ZooKeeper for leader election
public class LeaderElection {
private ZooKeeper zk;
private String leaderPath = "/leader";
private String myPath;
public void participateInElection() throws Exception {
// Create ephemeral node
myPath = zk.create(leaderPath + "/node-",
data.getBytes(),
ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL_SEQUENTIAL);
// Watch for leader changes
checkLeader();
}
private void checkLeader() throws Exception {
List<String> children = zk.getChildren(leaderPath, false);
Collections.sort(children);
String smallest = children.get(0);
String leaderPath = this.leaderPath + "/" + smallest;
Stat stat = zk.exists(leaderPath, new Watcher() {
@Override
public void process(WatchedEvent event) {
if (event.getType() == EventType.NodeDeleted) {
checkLeader(); // Re-check if leader dies
}
}
});
if (leaderPath.equals(myPath)) {
becomeLeader();
} else {
becomeFollower();
}
}
}
etcd Leader Election
// Using etcd for leader election
func (n *Node) ElectLeader(ctx context.Context) error {
// Try to acquire lock (leader key)
lease, err := n.client.Grant(ctx, 10) // 10 second lease
if err != nil {
return err
}
// Try to put with lease (only succeeds if key doesn't exist)
txn := n.client.Txn(ctx).
If(clientv3.Compare(clientv3.CreateRevision("/leader"), "=", 0)).
Then(clientv3.OpPut("/leader", n.id, clientv3.WithLease(lease.ID))).
Else(clientv3.OpGet("/leader"))
txnResp, err := txn.Commit()
if err != nil {
return err
}
if txnResp.Succeeded {
// I'm the leader
n.becomeLeader(lease.ID)
return nil
}
// Someone else is leader
n.becomeFollower()
return nil
}
Common Pitfalls
- Split-brain problem: Network partition causes multiple leaders. Fix: Use quorum, require majority votes
- Not handling leader failures: System stops when leader dies. Fix: Automatic re-election, health checks
- Stale leaders: Old leader thinks it's still leader after partition. Fix: Use terms/epochs, leader must prove it has majority
- Election storms: Multiple nodes start election simultaneously. Fix: Randomized timeouts (Raft), priority-based (Bully)
- No leader timeout: System waits forever for leader. Fix: Set maximum election timeout, fail fast
- Ignoring network partitions: Assumes all nodes can communicate. Fix: Design for partition tolerance, use quorum
- Not persisting election state: Leader info lost on restart. Fix: Store in persistent storage (ZooKeeper, etcd)
Interview Questions
Beginner
Q: What is leader election and why is it needed in distributed systems?
A: Leader election is the process of selecting one node among multiple nodes to act as the coordinator or leader. It's needed because:
- Coordination: Prevents conflicts when multiple nodes try to perform the same task
- Consistency: Leader makes decisions, followers replicate (simpler than all nodes agreeing)
- Resource management: Leader allocates work, manages shared state
- Simplified consensus: Easier to have one leader than coordinate all nodes
Examples: Database primary node (all writes go to leader), distributed lock manager, configuration coordinator.
Intermediate
Q: Compare the Bully and Ring algorithms for leader election. When would you use each?
A:
Bully Algorithm:
- Process: Node with highest ID becomes leader
- Messages: O(n) messages, fast
- Requirements: All nodes know all other nodes
- Pros: Simple, fast, deterministic
- Cons: Higher ID always wins (may not be best), network partitions cause issues
Ring Algorithm:
- Process: Election message travels around logical ring
- Messages: O(n) messages, but sequential (slower)
- Requirements: Logical ring structure
- Pros: Works with partial knowledge, handles failures
- Cons: Slower, ring maintenance overhead, single point of failure
When to use:
- Bully: Small clusters, all nodes known, need fast election
- Ring: Large clusters, partial network knowledge, ring topology exists
Better alternative: Use Raft or existing systems (ZooKeeper, etcd) that handle edge cases.
Senior
Q: Design a leader election system for a distributed database with 100 nodes across 3 data centers. How do you handle network partitions, ensure only one leader, and minimize election time?
A:
Architecture:
- Raft-based election with quorum requirements
- Multi-datacenter awareness (prefer leader in primary DC)
- Health checks and automatic failover
- Persistent election state (etcd/ZooKeeper)
Design:
class DistributedLeaderElection {
private nodes: Map<DataCenter, Node[]>;
private currentLeader: LeaderInfo | null = null;
private electionService: RaftElection;
async electLeader(): Promise<LeaderInfo> {
// Prefer leader in primary data center
const primaryDC = this.getPrimaryDataCenter();
const candidates = this.getHealthyNodes(primaryDC);
if (candidates.length === 0) {
// Fallback to secondary DCs
candidates.push(...this.getHealthyNodesFromSecondaryDCs());
}
// Use Raft with quorum
const leader = await this.raftElection.elect(candidates, {
quorum: Math.floor(this.getTotalNodes() / 2) + 1,
timeout: 5000, // 5 second max election time
retryBackoff: 'exponential'
});
return leader;
}
}
Handling Network Partitions:
- Quorum requirement: Leader must have majority votes (51+ nodes)
- Split detection: Monitor connectivity, detect partitions
- Partition handling:
- Majority partition: Continues with leader
- Minority partition: Cannot elect leader, read-only mode
- Merge handling: When partitions merge, compare terms, highest term wins
Ensuring Single Leader:
- Terms/epochs: Each election increments term, old leaders rejected
- Leader lease: Leader must renew lease (heartbeat), expires if no renewal
- Fencing tokens: Operations include token, old leader's tokens rejected
- Quorum writes: Leader must get majority acknowledgment for writes
Minimizing Election Time:
- Randomized timeouts: Prevent election storms (150-300ms random)
- Fast failure detection: Health checks every 100ms
- Parallel vote requests: Request votes from all nodes simultaneously
- Pre-voting: Check if current leader is still alive before starting election
- Leader priority: Prefer nodes in primary DC, with better network
Implementation Details:
class OptimizedRaftElection {
async electLeader(candidates: Node[]): Promise<LeaderInfo> {
// Pre-vote: Check if current leader is alive
const currentLeaderAlive = await this.checkLeaderHealth();
if (currentLeaderAlive) {
return this.currentLeader; // No election needed
}
// Start election with randomized timeout
const timeout = 150 + Math.random() * 150;
await this.sleep(timeout);
// Request votes in parallel
const votes = await Promise.all(
candidates.map(node => this.requestVote(node))
);
const voteCount = votes.filter(v => v).length;
const quorum = Math.floor(candidates.length / 2) + 1;
if (voteCount >= quorum) {
return await this.becomeLeader();
}
// Retry with exponential backoff
throw new ElectionFailedError('Insufficient votes');
}
}
Monitoring:
- Election duration metrics
- Leader stability (how long leader stays)
- Partition detection and resolution time
- False positive elections (leader was alive)
Failover Strategy:
- Graceful: Current leader steps down, triggers election
- Ungraceful: Health check fails, automatic election starts
- Target: < 1 second election time, < 5 second total failover
Key Takeaways
- Leader election coordinates distributed systems by selecting a single coordinator
- Bully algorithm: Highest ID wins, simple but requires full network knowledge
- Ring algorithm: Message travels ring, works with partial knowledge but slower
- Raft algorithm: Randomized timeouts, voting, handles partitions well
- Use existing systems (ZooKeeper, etcd) for production - they handle edge cases
- Quorum requirement prevents split-brain (need majority votes)
- Terms/epochs ensure old leaders are rejected after network partitions
- Randomized timeouts prevent election storms when multiple nodes start election
- Health checks detect leader failures quickly to minimize downtime
- Network partitions require quorum - minority partition cannot elect leader
- Leader lease/heartbeat ensures leader is still alive, auto-failover if not