Leader Election
How distributed systems agree on a single coordinator — the algorithms, failure modes, and trade-offs of electing a leader without a central authority.
Leader Election
Think of a city council that needs a chair. Someone has to run the meeting, break ties, and speak for the group — but any member could do it. When the current chair steps down (or disappears), the group needs a fast, fair procedure to pick a new one. They can't call a central authority; they have to agree among themselves.
That is leader election: a group of nodes in a distributed system selecting one coordinator, without any external referee.
Why a Leader at All?
Many distributed problems are easier with one node in charge:
- Serialisation — a single leader assigns a total order to writes, preventing conflicting updates.
- Coordination — assigning work, locking a resource, or deciding a config change is simpler with one decider.
- Fan-out — the leader collects client requests and replicates them, so followers need only respond to one source.
The cost is a single point of failure. Leader election is how you recover from that failure automatically.
The Fundamental Tension
Any election algorithm must satisfy two opposing properties:
| Property | Meaning |
|---|---|
| Safety | At most one leader at a time (no split-brain) |
| Liveness | A leader is eventually elected (system makes progress) |
Under network partitions, you cannot guarantee both simultaneously. Every real algorithm picks a side — usually safety — and accepts a temporary unavailability window during the election.
Algorithm 1: Bully Algorithm
The simplest election. Each node has a unique numeric ID. When a node detects the leader is gone:
1. Node N broadcasts ELECTION to all nodes with ID > N
2. Any higher-ID node responds OK and starts its own election
3. If N hears no OK within timeout → N declares itself leader, broadcasts COORDINATOR
4. If N hears OK → N waits; the higher node will eventually declare
IDs: 1 2 3 4 5(leader, crashed)
↑
3 detects crash, sends ELECTION to 4, 5
↑
4 responds OK, sends ELECTION to 5
↑
5 is dead — 4 hears nothing
4 declares COORDINATOR
Pros: simple, fast convergence — O(n²) messages worst case.
Cons: requires stable unique IDs; the highest-ID node always wins regardless of how fresh its state is — dangerous if it has a stale log.
Algorithm 2: Ring Election
Nodes are arranged in a logical ring. Each node knows its successor:
1. A node that detects leader failure sends ELECTION(myID) to its successor
2. Each node forwards the message, replacing the ID with max(received, myID)
3. When the originator receives its own message back, the ID inside is the maximum — that node becomes leader
4. A COORDINATOR message travels the ring to inform everyone
A → B → C → D → A
C starts: ELECTION(C)
D forwards: ELECTION(D) ← D is higher
A forwards: ELECTION(D)
B forwards: ELECTION(D)
C receives ELECTION(D) — D is leader, C sends COORDINATOR(D)
Pros: O(n) messages, no global broadcast.
Cons: two full ring traversals (election + coordinator); a broken link can stall the election.
Algorithm 3: Quorum-Based Election (Raft, ZAB)
Modern systems use a voting protocol: a candidate must receive votes from a majority of nodes before it can lead.
States: Follower → Candidate → Leader
Candidate → Follower (lost election)
Transitions:
[Follower] heartbeat timeout → [Candidate]
[Candidate] majority votes → [Leader]
[Candidate] sees higher term → [Follower]
[Leader] sees higher term → [Follower]
A vote is only granted if the candidate's log is at least as up-to-date as the voter's. This safety invariant means the elected leader is guaranteed to have every committed entry — you'll never elect a node that would lose data.
Term 1: A=Leader ─── heartbeats ──→ B, C
A crashes
Term 2: B times out, increments term to 2
B sends RequestVote(term=2, lastLogIndex=5) → A(dead), C
C checks: B's log ≥ C's log? Yes → votes for B
B has 2/3 votes (majority) → B becomes Leader for term 2
Pros: safe under arbitrary partitions; the winner is always the most up-to-date node.
Cons: requires a majority to be reachable — a 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2.
The Election Window
During an election, the cluster cannot serve writes. The duration depends on:
election_window ≈ failure_detection_time + election_timeout + voting_round_trips
≈ (1–10 s) + (150–500 ms) + (1–2 × RTT)
Tuning election_timeout is the main lever:
| Timeout | Effect |
|---|---|
| Too short (< 2× RTT) | False positives — slow network looks like a crash |
| Too long (> 5 s) | Long unavailability after a real failure |
| Randomised range | Prevents simultaneous candidates, reduces split votes |
etcd defaults to 200 ms heartbeat, 1 s election timeout. Kubernetes API server targeting 10 s lease renewal is much more conservative because a false failover is expensive.
Split-Brain and Fencing
The most dangerous failure: two nodes both believe they are leader. This happens when a node is isolated (network partition) but not dead — it stops receiving heartbeats and starts a new term while the real leader is still running in the other partition.
Quorum solves this structurally: only one partition can have a majority, so only one leader can hold a valid lease. But at the storage layer, you still need a fencing token — a monotonically increasing number (the term or epoch) included in every write request. Storage rejects writes from tokens older than the current maximum, blocking the stale leader even if it never learned it was deposed.
Leader (term 4) is partitioned, continues writing
New leader (term 5) is elected
Storage sees write from term 4 < current max term 5 → REJECT
Real-World Implementations
| System | Algorithm | Lease Duration | Failure Tolerance |
|---|---|---|---|
| etcd / Raft | Quorum vote + log freshness | 1–5 s | ⌊(n-1)/2⌋ nodes |
| ZooKeeper (ZAB) | Quorum vote + epoch | 2–4 s | ⌊(n-1)/2⌋ nodes |
| Kafka KRaft | Raft variant | configurable | ⌊(n-1)/2⌋ nodes |
| Consul | Raft | 1 s default | ⌊(n-1)/2⌋ nodes |
When Not to Use a Leader
A leader is a serialisation point. If you need horizontal write scalability, you want leaderless replication (Dynamo, Cassandra) where any node accepts writes and conflicts are resolved at read time via vector clocks or last-write-wins. The trade-off: you give up strong consistency and must handle conflicts explicitly.
If coordination is only needed occasionally (e.g., a distributed lock), a lease from a coordination service like etcd is cheaper than running your own election protocol inside your application.