Database Replication: Single-Leader, Multi-Leader and Leaderless Explained

Why Replication Exists

A single database server is a single point of failure. It will eventually crash, run out of disk, need a reboot for OS updates, or be in the wrong datacenter when the network partitions. Replication solves this by maintaining copies of the data on multiple machines.

But replication is not free. Every replication model makes trade-offs between:

Consistency — do all replicas see the same data at the same time?
Availability — can the system accept reads and writes if some nodes are down?
Latency — how much does replication add to write latency?
Complexity — how hard is it to handle conflicts when replicas diverge?

There are three fundamental replication architectures. Each answers these trade-offs differently.

Synchronous vs Asynchronous Replication

Before the three architectures, one foundational choice cuts across all of them: when does a write complete?

Synchronous replication: The primary waits for at least one replica to confirm it has received and written the data before returning success to the client.

Client         Primary         Replica
  │               │               │
  │──── write ───→│               │
  │               │──── write ───→│
  │               │←── ack ───────│
  │←── success ───│               │
  │                               │
  [Both primary and replica have the data before client receives success]

Guarantee: If the primary fails immediately after the write, the replica has the data — no data loss.
Cost: Write latency = primary write time + network round-trip to replica + replica write time. If the replica is slow or unavailable, writes block.

Asynchronous replication: The primary writes locally and returns success immediately. Replication happens in the background.

Client         Primary         Replica
  │               │               │
  │──── write ───→│               │
  │←── success ───│               │
  │               │──── write ───→│  (happens asynchronously, after success returned)
  │               │               │
  [Client receives success before replica has the data]

Guarantee: Lowest write latency. Primary never blocks waiting for replicas.
Cost: If the primary fails before replication completes, that data is lost. Replicas may lag seconds or minutes behind.

Semi-synchronous (common in practice): Wait for confirmation from at least one replica before returning success. If that replica is down, fall back to async. This balances durability and availability.

-- MySQL semi-synchronous replication
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000; -- 1 second timeout before falling back to async

-- PostgreSQL synchronous replication
-- postgresql.conf:
synchronous_standby_names = 'replica1'  -- wait for this replica
synchronous_commit = on                  -- default: wait for WAL flush on replica
-- synchronous_commit = remote_write    -- wait for replica to receive WAL (faster, slight risk)
-- synchronous_commit = local           -- only wait for local WAL flush (async replication)

Architecture 1: Single-Leader Replication

One node is designated the leader (also called primary or master). All writes go to the leader. The leader replicates changes to followers (replicas, standbys). Reads can go to either the leader or followers.

           ┌─────────────────────────────────────┐
           │            SINGLE LEADER             │
           │                                      │
  Writes──→│  Leader (Primary)                   │
  Reads ──→│  - Accepts all writes               │
           │  - Replicates to followers           │
           └─────────────────────────────────────┘
                    │                │
          WAL/binlog stream    WAL/binlog stream
                    │                │
           ┌────────┴───┐    ┌───────┴────┐
           │ Follower 1 │    │ Follower 2 │
           │ Read-only  │    │ Read-only  │
           └────────────┘    └────────────┘

Used by: PostgreSQL streaming replication, MySQL replication, MongoDB replica sets, Kafka (within a partition).

How Replication Log Works

There are three approaches to what the leader sends to followers:

Statement-based replication: The leader logs every SQL statement and sends it to followers, which re-execute it.

Leader sends: UPDATE accounts SET balance = balance * 1.1 WHERE status = 'active'
Follower re-executes the same SQL

Problem: non-deterministic functions (NOW(), RAND(), UUID()) produce different results on each execution. A statement with WHERE created_at < NOW() may match different rows on the leader vs a lagging follower.

WAL-based (physical) replication (PostgreSQL): The leader sends its WAL bytes directly. Followers apply the exact same page changes.

Leader sends: WAL record "page (42, 3): bytes 120-124 changed from 0x... to 0x..."
Follower applies exact byte change to its copy of page 42

Advantage: perfectly deterministic. Disadvantage: tightly coupled to storage engine internals — followers must use the same PostgreSQL version and page format.

Row-based (logical) replication: The leader logs the before and after state of each row change.

Leader sends: UPDATE row id=1: {balance: 1000} → {balance: 1100}
Follower applies the row change without re-executing SQL

Advantage: deterministic, version-independent, can replicate to different database engines. Disadvantage: large batch updates generate large replication streams (one row event per row changed).

Adding a New Follower Without Downtime

1. Take a consistent snapshot of the leader (pg_basebackup in PostgreSQL)
   - This snapshot has a known WAL position (LSN)
2. Copy the snapshot to the new follower
3. Follower connects to leader and requests WAL from the snapshot LSN
4. Follower replays WAL to catch up to the present
5. Follower is now in sync — available for reads

This entire process happens without locking or downtime on the leader.

Architecture 2: Multi-Leader Replication

Multiple nodes can accept writes. Each leader replicates its changes to all other leaders and all followers.

Datacenter A                     Datacenter B
┌────────────────────┐           ┌────────────────────┐
│   Leader A         │◄─────────►│   Leader B         │
│   (accepts writes) │           │   (accepts writes) │
└─────────┬──────────┘           └──────────┬─────────┘
          │                                  │
    Followers A                        Followers B

Used by: MySQL Group Replication, CouchDB, Google Docs (operational transformation), some multi-region PostgreSQL setups (e.g. BDR).

The key benefit: Writes can be accepted in multiple datacenters simultaneously. Users in Europe write to the European leader; users in Asia write to the Asian leader. No cross-datacenter write latency.

The key problem: write conflicts.

Leader A and Leader B both accept writes to the same row simultaneously:

Leader A: UPDATE products SET price = 100 WHERE id = 5  (user in Europe)
Leader B: UPDATE products SET price = 150 WHERE id = 5  (user in Asia)

Both commit locally. Both replicate to the other.
Now Leader A has price=150 and Leader B has price=100 — which is correct?

Conflict Resolution Strategies

Last Write Wins (LWW): Each write is timestamped. The write with the latest timestamp wins.

Simple to implement.
Problem: clock skew between servers means "latest timestamp" is unreliable. Silently discards writes with slightly earlier timestamps — data loss with no error.

Conflict-free Replicated Data Types (CRDTs): Data structures designed so concurrent updates can always be merged without conflicts.

Counters: increment operations commute — +5 and +3 in any order always gives +8.
Sets: add operations commute. Remove can be tricky (two-phase set: elements can only be removed if they were previously added).
Used in: Riak, Redis CRDT module, collaborative editing systems.

Application-level conflict resolution: On conflict detection, call application code to decide.

def resolve_conflict(local_value, remote_value, local_timestamp, remote_timestamp):
    # Application decides: maybe merge, maybe use higher value, maybe flag for human review
    if local_value == remote_value:
        return local_value  # not a real conflict
    # Custom logic here
    return max(local_value, remote_value)  # example: take higher price

Avoid conflicts: Route all writes for the same record to the same leader (e.g. user ID determines which datacenter handles their data). Most conflicts never occur if the routing is consistent.

Architecture 3: Leaderless Replication

No node is designated leader. Any node can accept writes. The system uses quorums to ensure reads and writes overlap — guaranteeing that at least one node in every read has the latest data.

         Node 1    Node 2    Node 3
Write ──→  ✅        ✅        ❌    (2 of 3 acknowledge = quorum met)
Read  ──→  ✅        ❌        ✅    (2 of 3 respond = quorum met)

Used by: Amazon Dynamo (original paper), Cassandra, Riak, Voldemort.

Quorum Reads and Writes

With n replicas, w write confirmations required, and r read replicas queried:

Quorum condition: w + r > n

Common configuration: n=3, w=2, r=2 (2+2 > 3 ✅)
→ At least one node in every read has the latest write

Strong consistency: w=n (wait for all replicas) — slow, less available
High availability: w=1, r=1 — fast but may read stale data
Balanced: w=2, r=2, n=3 — good availability, tolerates one node failure

-- Cassandra quorum configuration per query:
INSERT INTO orders (id, amount) VALUES (1, 299.99)
USING CONSISTENCY QUORUM;  -- wait for majority of replicas

SELECT * FROM orders WHERE id = 1
USING CONSISTENCY QUORUM;  -- read from majority, compare versions

Version vectors: Each node tracks a version number for every key. When a read returns values from multiple nodes, the client compares version vectors to determine which value is newer — or whether a conflict exists.

Read from 3 nodes:
  Node 1: {balance: 900, version: [3, 2, 2]}
  Node 2: {balance: 900, version: [3, 2, 2]}  ← same version = consistent
  Node 3: {balance: 800, version: [2, 2, 2]}  ← stale, Node 1's version dominates

Return: balance=900 (from nodes 1 and 2), trigger read repair on Node 3

Read Repair and Anti-Entropy

When a read discovers that some replicas have stale data, the system updates them:

Read repair: When a client reads from multiple nodes and detects a stale value, it writes the latest value back to the stale nodes. Happens synchronously during reads.

Anti-entropy process: A background process continuously compares replicas using Merkle trees (hash trees) to identify and sync divergent data, independent of reads.

Merkle tree comparison:
  Node 1 tree hash: 0xABC
  Node 2 tree hash: 0xABC   ← same, no sync needed
  Node 3 tree hash: 0xDEF   ← different, drill down to find divergent subtree
    Node 3 subtree for keys 1000-2000 hash: 0x123
    Node 1 subtree for keys 1000-2000 hash: 0x456 ← different, sync this range

Sloppy Quorum and Hinted Handoff

What happens when enough nodes are unavailable to meet the quorum? Strict quorum would reject the write. Sloppy quorum accepts the write on any available nodes:

Normal: write to [Node 1, Node 2, Node 3]
Node 3 is unreachable → sloppy quorum writes to [Node 1, Node 2, Node 4]
Node 4 holds a "hint": "this write belongs to Node 3, deliver when it recovers"

When Node 3 recovers:
Node 4 delivers the hinted data to Node 3 (hinted handoff)

Trade-off: Sloppy quorum improves write availability but breaks the w + r > n guarantee temporarily. A read during the outage may not see the latest write if it does not contact Node 4.

Comparison: When to Use Each Model

	Single-Leader	Multi-Leader	Leaderless
Write conflicts	Impossible	Possible — need resolution	Possible — need resolution
Write latency	Leader's latency	Local datacenter latency	Quorum latency
Read scalability	Add read replicas	Add read replicas	All nodes serve reads
Multi-datacenter	One DC for writes	Natural fit	Natural fit
Consistency	Strong (sync) or eventual (async)	Eventual by default	Tunable via quorum
Operational complexity	Low	High (conflict resolution)	Medium
Examples	PostgreSQL, MySQL, MongoDB	MySQL Group Replication, CouchDB	Cassandra, DynamoDB, Riak

In practice:

Single-leader is the right default for most applications — simple, predictable, well-understood failure modes.
Multi-leader makes sense for active-active multi-datacenter setups where write latency across datacenters is unacceptable and conflict resolution is manageable.
Leaderless is the right choice for write-heavy, globally distributed workloads where availability trumps strong consistency (time-series, IoT, user activity events).

🧭 What's Next

Post 13: Replication Lag & Consistency Guarantees — even with single-leader replication, async lag creates subtle bugs; read-your-own-writes, monotonic reads, and consistent prefix reads are the guarantees that prevent them

Why Replication Exists

Synchronous vs Asynchronous Replication

Architecture 1: Single-Leader Replication

How Replication Log Works

Adding a New Follower Without Downtime

Architecture 2: Multi-Leader Replication

Conflict Resolution Strategies

Architecture 3: Leaderless Replication

Quorum Reads and Writes

Read Repair and Anti-Entropy

Sloppy Quorum and Hinted Handoff

Comparison: When to Use Each Model

🧭 What's Next

Related

WAL, Crash Recovery and Durability: How Databases Survive Power Failures

B-Tree vs LSM-Tree: Why InnoDB and RocksDB Make Different Trade-offs

How Databases Store Data: Pages, Heap Files and Row vs Column Storage

Leave a comment

Comments