Leader Election

Think of a city council that needs a chair. Someone has to run the meeting, break ties, and speak for the group — but any member could do it. When the current chair steps down (or disappears), the group needs a fast, fair procedure to pick a new one. They can't call a central authority; they have to agree among themselves.

That is leader election: a group of nodes in a distributed system selecting one coordinator, without any external referee.

Why a Leader at All?

Many distributed problems are easier with one node in charge:

Serialisation — a single leader assigns a total order to writes, preventing conflicting updates.
Coordination — assigning work, locking a resource, or deciding a config change is simpler with one decider.
Fan-out — the leader collects client requests and replicates them, so followers need only respond to one source.

The cost is a single point of failure. Leader election is how you recover from that failure automatically.

The Fundamental Tension

Any election algorithm must satisfy two opposing properties:

Property	Meaning
Safety	At most one leader at a time (no split-brain)
Liveness	A leader is eventually elected (system makes progress)

Under network partitions, you cannot guarantee both simultaneously. Every real algorithm picks a side — usually safety — and accepts a temporary unavailability window during the election.

Algorithm 1: Bully Algorithm

The simplest election. Each node has a unique numeric ID. When a node detects the leader is gone:

1. Node N broadcasts ELECTION to all nodes with ID > N
2. Any higher-ID node responds OK and starts its own election
3. If N hears no OK within timeout → N declares itself leader, broadcasts COORDINATOR
4. If N hears OK → N waits; the higher node will eventually declare

IDs:  1    2    3    4    5(leader, crashed)
      ↑
 3 detects crash, sends ELECTION to 4, 5
                              ↑
                   4 responds OK, sends ELECTION to 5
                                              ↑
                              5 is dead — 4 hears nothing
                              4 declares COORDINATOR

Pros: simple, fast convergence — O(n²) messages worst case.
Cons: requires stable unique IDs; the highest-ID node always wins regardless of how fresh its state is — dangerous if it has a stale log.

Algorithm 2: Ring Election

Nodes are arranged in a logical ring. Each node knows its successor:

1. A node that detects leader failure sends ELECTION(myID) to its successor
2. Each node forwards the message, replacing the ID with max(received, myID)
3. When the originator receives its own message back, the ID inside is the maximum — that node becomes leader
4. A COORDINATOR message travels the ring to inform everyone

  A → B → C → D → A
  
  C starts: ELECTION(C)
  D forwards: ELECTION(D)   ← D is higher
  A forwards: ELECTION(D)
  B forwards: ELECTION(D)
  C receives ELECTION(D) — D is leader, C sends COORDINATOR(D)

Pros: O(n) messages, no global broadcast.
Cons: two full ring traversals (election + coordinator); a broken link can stall the election.

Algorithm 3: Quorum-Based Election (Raft, ZAB)

Modern systems use a voting protocol: a candidate must receive votes from a majority of nodes before it can lead.

States: Follower → Candidate → Leader
        Candidate → Follower  (lost election)

Transitions:
  [Follower] heartbeat timeout  → [Candidate]
  [Candidate] majority votes    → [Leader]
  [Candidate] sees higher term  → [Follower]
  [Leader] sees higher term     → [Follower]

A vote is only granted if the candidate's log is at least as up-to-date as the voter's. This safety invariant means the elected leader is guaranteed to have every committed entry — you'll never elect a node that would lose data.

Term 1:  A=Leader ─── heartbeats ──→ B, C

         A crashes

Term 2:  B times out, increments term to 2
         B sends RequestVote(term=2, lastLogIndex=5) → A(dead), C
         C checks: B's log ≥ C's log? Yes → votes for B
         B has 2/3 votes (majority) → B becomes Leader for term 2

Pros: safe under arbitrary partitions; the winner is always the most up-to-date node.
Cons: requires a majority to be reachable — a 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2.

The Election Window

During an election, the cluster cannot serve writes. The duration depends on:

election_window ≈ failure_detection_time + election_timeout + voting_round_trips
                ≈      (1–10 s)          +    (150–500 ms)  +     (1–2 × RTT)

Tuning election_timeout is the main lever:

Timeout	Effect
Too short (< 2× RTT)	False positives — slow network looks like a crash
Too long (> 5 s)	Long unavailability after a real failure
Randomised range	Prevents simultaneous candidates, reduces split votes

etcd defaults to 200 ms heartbeat, 1 s election timeout. Kubernetes API server targeting 10 s lease renewal is much more conservative because a false failover is expensive.

Split-Brain and Fencing

The most dangerous failure: two nodes both believe they are leader. This happens when a node is isolated (network partition) but not dead — it stops receiving heartbeats and starts a new term while the real leader is still running in the other partition.

Quorum solves this structurally: only one partition can have a majority, so only one leader can hold a valid lease. But at the storage layer, you still need a fencing token — a monotonically increasing number (the term or epoch) included in every write request. Storage rejects writes from tokens older than the current maximum, blocking the stale leader even if it never learned it was deposed.

Leader (term 4) is partitioned, continues writing
New leader (term 5) is elected
Storage sees write from term 4 < current max term 5 → REJECT

Real-World Implementations

System	Algorithm	Lease Duration	Failure Tolerance
etcd / Raft	Quorum vote + log freshness	1–5 s	⌊(n-1)/2⌋ nodes
ZooKeeper (ZAB)	Quorum vote + epoch	2–4 s	⌊(n-1)/2⌋ nodes
Kafka KRaft	Raft variant	configurable	⌊(n-1)/2⌋ nodes
Consul	Raft	1 s default	⌊(n-1)/2⌋ nodes

When Not to Use a Leader

A leader is a serialisation point. If you need horizontal write scalability, you want leaderless replication (Dynamo, Cassandra) where any node accepts writes and conflicts are resolved at read time via vector clocks or last-write-wins. The trade-off: you give up strong consistency and must handle conflicts explicitly.

If coordination is only needed occasionally (e.g., a distributed lock), a lease from a coordination service like etcd is cheaper than running your own election protocol inside your application.