Database Replication: Single-Leader, Multi-Leader and Leaderless Explained
Why Replication Exists
A single database server is a single point of failure. It will eventually crash, run out of disk, need a reboot for OS updates, or be in the wrong datacenter when the network partitions. Replication solves this by maintaining copies of the data on multiple machines.
But replication is not free. Every replication model makes trade-offs between:
Consistency — do all replicas see the same data at the same time?
Availability — can the system accept reads and writes if some nodes are down?
Latency — how much does replication add to write latency?
Complexity — how hard is it to handle conflicts when replicas diverge?
There are three fundamental replication architectures. Each answers these trade-offs differently.
Synchronous vs Asynchronous Replication
Before the three architectures, one foundational choice cuts across all of them: when does a write complete?
Synchronous replication: The primary waits for at least one replica to confirm it has received and written the data before returning success to the client.
Client Primary Replica
│ │ │
│──── write ───→│ │
│ │──── write ───→│
│ │←── ack ───────│
│←── success ───│ │
│ │
[Both primary and replica have the data before client receives success]Guarantee: If the primary fails immediately after the write, the replica has the data — no data loss.
Cost: Write latency = primary write time + network round-trip to replica + replica write time. If the replica is slow or unavailable, writes block.
Asynchronous replication: The primary writes locally and returns success immediately. Replication happens in the background.
Client Primary Replica
│ │ │
│──── write ───→│ │
│←── success ───│ │
│ │──── write ───→│ (happens asynchronously, after success returned)
│ │ │
[Client receives success before replica has the data]Guarantee: Lowest write latency. Primary never blocks waiting for replicas.
Cost: If the primary fails before replication completes, that data is lost. Replicas may lag seconds or minutes behind.
Semi-synchronous (common in practice): Wait for confirmation from at least one replica before returning success. If that replica is down, fall back to async. This balances durability and availability.
-- MySQL semi-synchronous replication
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000; -- 1 second timeout before falling back to async
-- PostgreSQL synchronous replication
-- postgresql.conf:
synchronous_standby_names = 'replica1' -- wait for this replica
synchronous_commit = on -- default: wait for WAL flush on replica
-- synchronous_commit = remote_write -- wait for replica to receive WAL (faster, slight risk)
-- synchronous_commit = local -- only wait for local WAL flush (async replication)Architecture 1: Single-Leader Replication
One node is designated the leader (also called primary or master). All writes go to the leader. The leader replicates changes to followers (replicas, standbys). Reads can go to either the leader or followers.
┌─────────────────────────────────────┐
│ SINGLE LEADER │
│ │
Writes──→│ Leader (Primary) │
Reads ──→│ - Accepts all writes │
│ - Replicates to followers │
└─────────────────────────────────────┘
│ │
WAL/binlog stream WAL/binlog stream
│ │
┌────────┴───┐ ┌───────┴────┐
│ Follower 1 │ │ Follower 2 │
│ Read-only │ │ Read-only │
└────────────┘ └────────────┘Used by: PostgreSQL streaming replication, MySQL replication, MongoDB replica sets, Kafka (within a partition).
How Replication Log Works
There are three approaches to what the leader sends to followers:
Statement-based replication: The leader logs every SQL statement and sends it to followers, which re-execute it.
Leader sends: UPDATE accounts SET balance = balance * 1.1 WHERE status = 'active'
Follower re-executes the same SQLProblem: non-deterministic functions (NOW(), RAND(), UUID()) produce different results on each execution. A statement with WHERE created_at < NOW() may match different rows on the leader vs a lagging follower.
WAL-based (physical) replication (PostgreSQL): The leader sends its WAL bytes directly. Followers apply the exact same page changes.
Leader sends: WAL record "page (42, 3): bytes 120-124 changed from 0x... to 0x..."
Follower applies exact byte change to its copy of page 42Advantage: perfectly deterministic. Disadvantage: tightly coupled to storage engine internals — followers must use the same PostgreSQL version and page format.
Row-based (logical) replication: The leader logs the before and after state of each row change.
Leader sends: UPDATE row id=1: {balance: 1000} → {balance: 1100}
Follower applies the row change without re-executing SQLAdvantage: deterministic, version-independent, can replicate to different database engines. Disadvantage: large batch updates generate large replication streams (one row event per row changed).
Adding a New Follower Without Downtime
1. Take a consistent snapshot of the leader (pg_basebackup in PostgreSQL)
- This snapshot has a known WAL position (LSN)
2. Copy the snapshot to the new follower
3. Follower connects to leader and requests WAL from the snapshot LSN
4. Follower replays WAL to catch up to the present
5. Follower is now in sync — available for readsThis entire process happens without locking or downtime on the leader.
Architecture 2: Multi-Leader Replication
Multiple nodes can accept writes. Each leader replicates its changes to all other leaders and all followers.
Datacenter A Datacenter B
┌────────────────────┐ ┌────────────────────┐
│ Leader A │◄─────────►│ Leader B │
│ (accepts writes) │ │ (accepts writes) │
└─────────┬──────────┘ └──────────┬─────────┘
│ │
Followers A Followers BUsed by: MySQL Group Replication, CouchDB, Google Docs (operational transformation), some multi-region PostgreSQL setups (e.g. BDR).
The key benefit: Writes can be accepted in multiple datacenters simultaneously. Users in Europe write to the European leader; users in Asia write to the Asian leader. No cross-datacenter write latency.
The key problem: write conflicts.
Leader A and Leader B both accept writes to the same row simultaneously:
Leader A: UPDATE products SET price = 100 WHERE id = 5 (user in Europe)
Leader B: UPDATE products SET price = 150 WHERE id = 5 (user in Asia)
Both commit locally. Both replicate to the other.
Now Leader A has price=150 and Leader B has price=100 — which is correct?Conflict Resolution Strategies
Last Write Wins (LWW): Each write is timestamped. The write with the latest timestamp wins.
Simple to implement.
Problem: clock skew between servers means "latest timestamp" is unreliable. Silently discards writes with slightly earlier timestamps — data loss with no error.
Conflict-free Replicated Data Types (CRDTs): Data structures designed so concurrent updates can always be merged without conflicts.
Counters: increment operations commute —
+5and+3in any order always gives+8.Sets: add operations commute. Remove can be tricky (two-phase set: elements can only be removed if they were previously added).
Used in: Riak, Redis CRDT module, collaborative editing systems.
Application-level conflict resolution: On conflict detection, call application code to decide.
def resolve_conflict(local_value, remote_value, local_timestamp, remote_timestamp):
# Application decides: maybe merge, maybe use higher value, maybe flag for human review
if local_value == remote_value:
return local_value # not a real conflict
# Custom logic here
return max(local_value, remote_value) # example: take higher priceAvoid conflicts: Route all writes for the same record to the same leader (e.g. user ID determines which datacenter handles their data). Most conflicts never occur if the routing is consistent.
Architecture 3: Leaderless Replication
No node is designated leader. Any node can accept writes. The system uses quorums to ensure reads and writes overlap — guaranteeing that at least one node in every read has the latest data.
Node 1 Node 2 Node 3
Write ──→ ✅ ✅ ❌ (2 of 3 acknowledge = quorum met)
Read ──→ ✅ ❌ ✅ (2 of 3 respond = quorum met)Used by: Amazon Dynamo (original paper), Cassandra, Riak, Voldemort.
Quorum Reads and Writes
With n replicas, w write confirmations required, and r read replicas queried:
Quorum condition: w + r > n
Common configuration: n=3, w=2, r=2 (2+2 > 3 ✅)
→ At least one node in every read has the latest write
Strong consistency: w=n (wait for all replicas) — slow, less available
High availability: w=1, r=1 — fast but may read stale data
Balanced: w=2, r=2, n=3 — good availability, tolerates one node failure-- Cassandra quorum configuration per query:
INSERT INTO orders (id, amount) VALUES (1, 299.99)
USING CONSISTENCY QUORUM; -- wait for majority of replicas
SELECT * FROM orders WHERE id = 1
USING CONSISTENCY QUORUM; -- read from majority, compare versionsVersion vectors: Each node tracks a version number for every key. When a read returns values from multiple nodes, the client compares version vectors to determine which value is newer — or whether a conflict exists.
Read from 3 nodes:
Node 1: {balance: 900, version: [3, 2, 2]}
Node 2: {balance: 900, version: [3, 2, 2]} ← same version = consistent
Node 3: {balance: 800, version: [2, 2, 2]} ← stale, Node 1's version dominates
Return: balance=900 (from nodes 1 and 2), trigger read repair on Node 3Read Repair and Anti-Entropy
When a read discovers that some replicas have stale data, the system updates them:
Read repair: When a client reads from multiple nodes and detects a stale value, it writes the latest value back to the stale nodes. Happens synchronously during reads.
Anti-entropy process: A background process continuously compares replicas using Merkle trees (hash trees) to identify and sync divergent data, independent of reads.
Merkle tree comparison:
Node 1 tree hash: 0xABC
Node 2 tree hash: 0xABC ← same, no sync needed
Node 3 tree hash: 0xDEF ← different, drill down to find divergent subtree
Node 3 subtree for keys 1000-2000 hash: 0x123
Node 1 subtree for keys 1000-2000 hash: 0x456 ← different, sync this rangeSloppy Quorum and Hinted Handoff
What happens when enough nodes are unavailable to meet the quorum? Strict quorum would reject the write. Sloppy quorum accepts the write on any available nodes:
Normal: write to [Node 1, Node 2, Node 3]
Node 3 is unreachable → sloppy quorum writes to [Node 1, Node 2, Node 4]
Node 4 holds a "hint": "this write belongs to Node 3, deliver when it recovers"
When Node 3 recovers:
Node 4 delivers the hinted data to Node 3 (hinted handoff)Trade-off: Sloppy quorum improves write availability but breaks the w + r > n guarantee temporarily. A read during the outage may not see the latest write if it does not contact Node 4.
Comparison: When to Use Each Model
Single-Leader | Multi-Leader | Leaderless | |
|---|---|---|---|
Write conflicts | Impossible | Possible — need resolution | Possible — need resolution |
Write latency | Leader's latency | Local datacenter latency | Quorum latency |
Read scalability | Add read replicas | Add read replicas | All nodes serve reads |
Multi-datacenter | One DC for writes | Natural fit | Natural fit |
Consistency | Strong (sync) or eventual (async) | Eventual by default | Tunable via quorum |
Operational complexity | Low | High (conflict resolution) | Medium |
Examples | PostgreSQL, MySQL, MongoDB | MySQL Group Replication, CouchDB | Cassandra, DynamoDB, Riak |
In practice:
Single-leader is the right default for most applications — simple, predictable, well-understood failure modes.
Multi-leader makes sense for active-active multi-datacenter setups where write latency across datacenters is unacceptable and conflict resolution is manageable.
Leaderless is the right choice for write-heavy, globally distributed workloads where availability trumps strong consistency (time-series, IoT, user activity events).
🧭 What's Next
Post 13: Replication Lag & Consistency Guarantees — even with single-leader replication, async lag creates subtle bugs; read-your-own-writes, monotonic reads, and consistent prefix reads are the guarantees that prevent them
Related
WAL, Crash Recovery and Durability: How Databases Survive Power Failures
How does a database survive a server crash without losing data? The Write-Ahead Log is the answer. This post explains WAL, ARIES recovery, and what durability actually guarantees.
B-Tree vs LSM-Tree: Why InnoDB and RocksDB Make Different Trade-offs
InnoDB uses B-Tree. RocksDB uses LSM-Tree. The choice determines read speed, write speed, and space amplification. Understanding both unlocks every storage engine decision.
How Databases Store Data: Pages, Heap Files and Row vs Column Storage
Every query you write becomes disk I/O. Understanding how databases store data — pages, heap files, row vs column layout — explains why some queries are fast and others are not.
Comments