Replication
How distributed systems copy data across multiple nodes to survive failures, scale reads, and reduce latency.
Replication
Keeping one copy of your data is fast and simple — until the machine dies. Replication solves this by maintaining identical (or nearly identical) copies of data on multiple nodes, so a failure anywhere is survivable.
Why Replication Exists
Three distinct problems drive replication:
| Problem | What replication buys you |
|---|---|
| Fault tolerance | If one node dies, others keep serving requests |
| Read scaling | Route reads to replicas; only writes hit the primary |
| Latency | Place replicas close to users in different regions |
These goals pull in different directions. Maximising read scale (many async replicas) conflicts with strong consistency. Geographic distribution magnifies replication lag. No single strategy wins everywhere — you choose based on what you can tolerate losing.
The Three Replication Topologies
Single-Leader (Primary–Replica)
One node is the leader (also called primary or master). All writes go through it. The leader records each write to its write-ahead log (WAL) and streams that log to one or more replicas (followers).
Client
│ write
▼
[Leader] ──WAL stream──▶ [Replica 1]
└──▶ [Replica 2]
Reads can go to any replica, but stale reads are possible unless you route to the leader or enforce read-your-writes.
Where you see it: PostgreSQL streaming replication, MySQL binlog, MongoDB replica sets, Kafka partition leaders.
Multi-Leader
Multiple nodes accept writes simultaneously. Each leader replicates its writes to the others. This unlocks active–active geo-distribution: users in Tokyo write to the Tokyo leader; users in Frankfurt write to the Frankfurt leader; both leaders sync in the background.
The cost: write conflicts. Two leaders that both accept a write to the same row at the same instant produce a conflict that requires resolution — last-write-wins, application logic, or CRDTs. Conflict resolution is genuinely hard to get right.
Where you see it: CouchDB, some MySQL cluster configs, Google Docs (CRDT-based).
Leaderless (Quorum Replication)
No designated leader. The client (or a coordinator) sends every write to W out of N replicas and waits for acknowledgement from W of them. Reads query R replicas and take the most recent value.
The constraint W + R > N guarantees that reads and writes overlap on at least one node, ensuring the freshest value is always returned.
┌──────────────────────────────────────────────┐
│ N = 3 replicas W = 2 R = 2 W+R=4 > 3 │
└──────────────────────────────────────────────┘
Write ──▶ Replica A ✓
──▶ Replica B ✓ (W=2 acks received → write succeeds)
──▶ Replica C ✗ (slow/down)
Read ◀── Replica A → v5
◀── Replica C → v4 (R=2 responses → take max → v5)
Leaderless replication is highly available: there is no single node whose failure stops writes. The tradeoff is that out-of-date replicas must be repaired, typically via read repair (the client writes back the freshest value during reads) and anti-entropy (background reconciliation).
Where you see it: Cassandra, DynamoDB, Riak, Amazon S3 (internally).
Synchronous vs. Asynchronous Replication
The most consequential decision in any replication setup is when the leader considers a write durable.
| Synchronous | Asynchronous | |
|---|---|---|
| Write acks after | Replica confirms write | Leader writes locally |
| Data loss on leader failure | None (replica is current) | Up to lag window |
| Write latency | Leader RTT + replica RTT | Leader RTT only |
| Replica failure impact | Blocks writes (if only sync replica) | No impact |
Most systems use a semi-synchronous approach: one replica is synchronous (guaranteeing one durable copy beyond the leader), the rest are async. PostgreSQL synchronous_commit, MySQL semi-sync, and Kafka acks=all all do this.
Replication Lag
In async setups the replica log trails the leader by some milliseconds to seconds. Three consistency anomalies emerge:
Read-your-own-writes: A user writes something, then immediately reads from a replica that hasn't caught up — and sees the old value. Fix: route reads for the user's own data to the leader for a short window after each write.
Monotonic reads: User reads v5 from replica A, then v4 from a slower replica B — a value going backwards. Fix: pin each user's reads to a single replica.
Consistent-prefix reads: A causally related sequence of writes is replicated out of order, making the past look incoherent. Fix: replicate related writes through the same log partition.
Split-Brain
When the network partitions, both halves may elect a leader (in single-leader systems) or accept writes (in leaderless ones). Two leaders writing different values to the same key with no coordination is a split-brain — data is silently lost or corrupted when the partition heals.
Prevention strategies:
- STONITH ("Shoot The Other Node In The Head") — fencing: the cluster forcibly terminates the node it can't reach before promoting a new leader.
- Epoch fencing — the new leader refuses to process writes from clients that still believe in the old leader.
- Quorum writes — requiring a majority for writes makes split-brain impossible as long as the majority side is the only one that can form a quorum.
Replication at the Hardware Level
A WAL stream is a sequence of binary-encoded operations flushed to disk by fsync. When PostgreSQL streams replication, it ships the raw WAL bytes over a TCP connection. The replica writes them to its own WAL and then replays them — effectively re-running every disk write the primary made.
This means replication throughput is bounded by:
- Leader fsync rate — how fast the leader can flush to disk
- Network bandwidth — each byte written on the leader must cross the wire
- Replica apply rate — how fast the replica can replay operations
On modern NVMe storage, fsync latency is ~100 µs. A transatlantic link adds ~70 ms RTT. Synchronous replication over a geo-distributed link will floor your write throughput to ~14 writes/s — which is why geo-sync replication is rare and async geo-replication is almost universal.
Choosing a Strategy
| If you need… | Use |
|---|---|
| Simple HA, single region | Single-leader, semi-sync |
| Read scaling, tolerate stale reads | Single-leader, async replicas |
| Active–active multi-region | Multi-leader + conflict resolution |
| Leaderless availability, tunable consistency | Quorum (Cassandra/DynamoDB model) |
| Zero-RPO disaster recovery | Sync replication to a standby |