A

autochitect

Explore software architecture ◆

Replication

2026-05-22System DesignDistributed SystemsConsistency

How distributed systems copy data across multiple nodes to survive failures, scale reads, and reduce latency.

Replication

Keeping one copy of your data is fast and simple — until the machine dies. Replication solves this by maintaining identical (or nearly identical) copies of data on multiple nodes, so a failure anywhere is survivable.

Why Replication Exists

Three distinct problems drive replication:

Problem What replication buys you
Fault tolerance If one node dies, others keep serving requests
Read scaling Route reads to replicas; only writes hit the primary
Latency Place replicas close to users in different regions

These goals pull in different directions. Maximising read scale (many async replicas) conflicts with strong consistency. Geographic distribution magnifies replication lag. No single strategy wins everywhere — you choose based on what you can tolerate losing.

The Three Replication Topologies

Single-Leader (Primary–Replica)

One node is the leader (also called primary or master). All writes go through it. The leader records each write to its write-ahead log (WAL) and streams that log to one or more replicas (followers).

Client
  │  write
  ▼
[Leader] ──WAL stream──▶ [Replica 1]
                     └──▶ [Replica 2]

Reads can go to any replica, but stale reads are possible unless you route to the leader or enforce read-your-writes.

Where you see it: PostgreSQL streaming replication, MySQL binlog, MongoDB replica sets, Kafka partition leaders.

Multi-Leader

Multiple nodes accept writes simultaneously. Each leader replicates its writes to the others. This unlocks active–active geo-distribution: users in Tokyo write to the Tokyo leader; users in Frankfurt write to the Frankfurt leader; both leaders sync in the background.

The cost: write conflicts. Two leaders that both accept a write to the same row at the same instant produce a conflict that requires resolution — last-write-wins, application logic, or CRDTs. Conflict resolution is genuinely hard to get right.

Where you see it: CouchDB, some MySQL cluster configs, Google Docs (CRDT-based).

Leaderless (Quorum Replication)

No designated leader. The client (or a coordinator) sends every write to W out of N replicas and waits for acknowledgement from W of them. Reads query R replicas and take the most recent value.

The constraint W + R > N guarantees that reads and writes overlap on at least one node, ensuring the freshest value is always returned.

         ┌──────────────────────────────────────────────┐
         │  N = 3 replicas   W = 2   R = 2   W+R=4 > 3  │
         └──────────────────────────────────────────────┘

Write ──▶ Replica A  ✓
      ──▶ Replica B  ✓       (W=2 acks received → write succeeds)
      ──▶ Replica C  ✗  (slow/down)

Read  ◀── Replica A  →  v5
      ◀── Replica C  →  v4   (R=2 responses → take max → v5)

Leaderless replication is highly available: there is no single node whose failure stops writes. The tradeoff is that out-of-date replicas must be repaired, typically via read repair (the client writes back the freshest value during reads) and anti-entropy (background reconciliation).

Where you see it: Cassandra, DynamoDB, Riak, Amazon S3 (internally).

Synchronous vs. Asynchronous Replication

The most consequential decision in any replication setup is when the leader considers a write durable.

Synchronous Asynchronous
Write acks after Replica confirms write Leader writes locally
Data loss on leader failure None (replica is current) Up to lag window
Write latency Leader RTT + replica RTT Leader RTT only
Replica failure impact Blocks writes (if only sync replica) No impact

Most systems use a semi-synchronous approach: one replica is synchronous (guaranteeing one durable copy beyond the leader), the rest are async. PostgreSQL synchronous_commit, MySQL semi-sync, and Kafka acks=all all do this.

Replication Lag

In async setups the replica log trails the leader by some milliseconds to seconds. Three consistency anomalies emerge:

Read-your-own-writes: A user writes something, then immediately reads from a replica that hasn't caught up — and sees the old value. Fix: route reads for the user's own data to the leader for a short window after each write.

Monotonic reads: User reads v5 from replica A, then v4 from a slower replica B — a value going backwards. Fix: pin each user's reads to a single replica.

Consistent-prefix reads: A causally related sequence of writes is replicated out of order, making the past look incoherent. Fix: replicate related writes through the same log partition.

Split-Brain

When the network partitions, both halves may elect a leader (in single-leader systems) or accept writes (in leaderless ones). Two leaders writing different values to the same key with no coordination is a split-brain — data is silently lost or corrupted when the partition heals.

Prevention strategies:

  • STONITH ("Shoot The Other Node In The Head") — fencing: the cluster forcibly terminates the node it can't reach before promoting a new leader.
  • Epoch fencing — the new leader refuses to process writes from clients that still believe in the old leader.
  • Quorum writes — requiring a majority for writes makes split-brain impossible as long as the majority side is the only one that can form a quorum.

Replication at the Hardware Level

A WAL stream is a sequence of binary-encoded operations flushed to disk by fsync. When PostgreSQL streams replication, it ships the raw WAL bytes over a TCP connection. The replica writes them to its own WAL and then replays them — effectively re-running every disk write the primary made.

This means replication throughput is bounded by:

  • Leader fsync rate — how fast the leader can flush to disk
  • Network bandwidth — each byte written on the leader must cross the wire
  • Replica apply rate — how fast the replica can replay operations

On modern NVMe storage, fsync latency is ~100 µs. A transatlantic link adds ~70 ms RTT. Synchronous replication over a geo-distributed link will floor your write throughput to ~14 writes/s — which is why geo-sync replication is rare and async geo-replication is almost universal.

Choosing a Strategy

If you need… Use
Simple HA, single region Single-leader, semi-sync
Read scaling, tolerate stale reads Single-leader, async replicas
Active–active multi-region Multi-leader + conflict resolution
Leaderless availability, tunable consistency Quorum (Cassandra/DynamoDB model)
Zero-RPO disaster recovery Sync replication to a standby