Failover, Consensus and High Availability: Raft, Paxos and Automatic Failover
The Failover Problem
The leader crashes. Every write is now blocked. You need a new leader immediately. You have three replicas — which one becomes the new leader?
This sounds simple. Pick the replica that is most up-to-date. Promote it. Done.
The problem: the old leader may not actually be dead. It may be alive but temporarily unreachable — a network partition, a long GC pause, a slow disk. If you promote a new leader while the old one is still accepting writes, you now have two leaders. Both accept writes. Neither knows about the other's writes. Your data diverges. This is the split-brain problem, and it is the reason automatic failover is hard.
Solving split-brain correctly requires consensus — a protocol that lets a group of nodes agree on exactly one value (the new leader) even when some nodes are unreachable. The two most important consensus algorithms are Paxos and Raft.
The Split-Brain Problem
Split-brain occurs when network partitions isolate nodes into two groups, each of which elects its own leader:
Normal state:
[Leader] ←→ [Replica 1] ←→ [Replica 2]
All three nodes communicate. One leader.
Network partition:
[Leader] ✗✗✗ [Replica 1] ←→ [Replica 2]
Leader is isolated. Replica 1 and 2 cannot reach it.
Without split-brain protection:
Replica 1 and 2 elect Replica 1 as new leader.
Original Leader is still alive and accepting writes.
Now TWO leaders accept writes simultaneously.
Data diverges. No automated way to reconcile.
With consensus (majority quorum):
Replica 1 and 2 have 2 of 3 votes = majority → can elect new leader.
Original Leader has 1 of 3 votes = minority → cannot continue as leader.
Only ONE leader at any time.The quorum rule: A leader can only operate if it has acknowledgment from a majority of nodes. A three-node cluster requires 2 votes. A five-node cluster requires 3 votes. The isolated minority cannot form a quorum — it cannot elect a leader or accept writes.
This is why HA clusters have an odd number of nodes: 3, 5, or 7. An even number creates a split-quorum edge case where both halves have exactly half the votes.
Manual vs Automatic Failover
Manual failover: A human operator decides when to promote a replica. Common in organizations with strict change management or where the cost of a wrong promotion (data loss) exceeds the cost of downtime.
Typical manual failover procedure:
1. Confirm the leader is actually dead (not a false alarm)
2. Identify the most up-to-date replica (highest replay LSN)
3. Promote the replica: SELECT pg_promote(); (PostgreSQL)
4. Update connection strings / DNS to point to new leader
5. Reconfigure old replicas to replicate from new leader
6. Investigate and rebuild the old leader before rejoining
Total time: 5–30 minutes depending on automation levelAutomatic failover: The cluster detects failure and elects a new leader without human intervention. Required for SLAs of 99.99% (52 minutes downtime/year) or higher.
Typical automatic failover with a consensus protocol:
1. Leader stops sending heartbeats (crash or network partition)
2. After election timeout (e.g. 150–300ms), a replica starts an election
3. Candidate requests votes from all other nodes
4. Majority votes received → new leader elected
5. New leader begins accepting writes
6. Old replicas receive heartbeat from new leader → update their leader pointer
Total time: election timeout + one round trip = typically 300ms–2 secondsRaft: Consensus Made Understandable
Raft was designed explicitly for understandability. Where Paxos is famously difficult to grasp, Raft decomposes consensus into three independent sub-problems: leader election, log replication, and safety.
Leader Election
Every node in a Raft cluster is in one of three states: follower, candidate, or leader.
State machine:
┌──────────────────────────────────────────────────────┐
│ │
│ Follower ──────election timeout──────→ Candidate │
│ ↑ │ │
│ │ │ │
│ receives wins majority vote │ │
│ heartbeat (becomes leader) │ │
│ from leader ↓ │
│ └────────────────────────────────── Leader │
│ │ │
│ discovers higher │ │
│ Candidate ←──────────term leader──────────┘ │
│ │ (steps down) │
│ └──────loses election or timeout──→ Follower │
└──────────────────────────────────────────────────────┘Terms: Raft divides time into terms — monotonically increasing integers. Each term begins with an election. If a candidate wins the election, it serves as leader for the rest of the term. If no candidate wins (split vote), the term ends with no leader and a new election begins.
Terms act as a logical clock. A node that receives a message from a higher term immediately updates its term and converts to follower — this is how stale leaders (isolated leaders who missed an election) discover they have been replaced.
Election process:
1. Follower's election timer expires (no heartbeat received)
2. Follower increments its term (term 4 → term 5)
3. Follower becomes Candidate, votes for itself
4. Candidate sends RequestVote RPC to all other nodes:
{term: 5, candidateId: node2, lastLogIndex: 42, lastLogTerm: 4}
5. Each node responds YES if:
- It has not already voted in term 5
- Candidate's log is at least as up-to-date as its own
(lastLogTerm > my log term, OR same term and lastLogIndex >= my log index)
6. Candidate receives majority → becomes Leader for term 5
7. Leader immediately sends heartbeat to all nodes to assert authorityWhy the log up-to-date check matters: Raft only elects leaders that have all committed entries. If a replica is missing recent committed writes, it cannot become leader — those writes would be lost. This is the "election restriction" that ensures Raft never loses committed data.
Log Replication
Once elected, the leader handles all client writes. It appends each write to its log, then replicates the log entry to followers:
Log replication sequence:
Client: write "balance=900"
│
▼
Leader appends to its log:
Index 43: {term:5, command: "balance=900"}
Leader sends AppendEntries RPC to all followers:
{term:5, leaderId:node2, prevLogIndex:42, prevLogTerm:4,
entries: [{index:43, term:5, command:"balance=900"}],
leaderCommit: 42}
Followers append to their logs, respond OK.
When majority has responded OK:
Leader marks entry 43 as COMMITTED
Leader applies entry 43 to its state machine
Leader sends response to client: SUCCESS
Leader includes commitIndex in next heartbeat
Followers apply committed entries to their state machinesSafety guarantee: An entry is committed only after a majority has written it to their logs. If the leader crashes, any node with a majority can still reconstruct all committed entries.
Handling Leader Crashes Mid-Replication
Scenario: Leader sends entry 43 to some followers, then crashes before commit
Node 1 (new leader): has entry 43 ✅
Node 2 (follower): has entry 43 ✅
Node 3 (follower): missing 43 ❌
Node 1 wins election (has entry 43, matches majority).
Node 1 becomes new leader.
Node 1 replicates entry 43 to Node 3 as part of normal log replication.
Entry 43 is now on majority → committed.
No data loss.What if the old leader had an entry that was NOT on a majority?
Node 1 (old leader): has entry 43 (not replicated before crash)
Node 2: does NOT have entry 43
Node 3: does NOT have entry 43
Node 2 wins election (entry 43 was never on majority — not committed).
Node 2 becomes new leader.
Node 2 overwrites Node 1's entry 43 with its own entries when Node 1 rejoins.
Entry 43 is lost — BUT it was never committed, so no committed data is lost.This is correct behavior: uncommitted entries may be lost on leader crash, but committed entries are always preserved.
Paxos: The Original Consensus Algorithm
Paxos was described by Leslie Lamport in 1989 (published 1998). It is provably correct and forms the theoretical basis for most consensus implementations. It is also famously difficult to understand and implement.
Single-decree Paxos (the basic form) reaches agreement on a single value in two phases:
Phase 1 — Prepare:
Proposer sends Prepare(n) to majority of acceptors
n = proposal number (must be unique and higher than any seen)
Acceptors respond with:
- Promise not to accept proposals with number < n
- The highest-numbered proposal they have already accepted (if any)
Phase 2 — Accept:
If proposer receives promises from majority:
If any acceptor returned an already-accepted value v,
proposer must use v (cannot choose its own value)
Otherwise proposer chooses its own value
Proposer sends Accept(n, v) to majority
Acceptors accept unless they promised a higher number
Value is chosen when majority accepts itMulti-Paxos extends single-decree Paxos to a replicated log (the practical need). A stable leader skips Phase 1 for subsequent proposals — only doing Phase 2. This optimization makes Multi-Paxos efficient for continuous log replication.
Paxos vs Raft:
Raft | Paxos (Multi-Paxos) | |
|---|---|---|
Understandability | Designed for clarity | Notoriously difficult |
Leader election | Explicit, term-based | Implicit (Phase 1 doubles as election) |
Log replication | Explicit leader-to-follower | Proposer drives each entry |
Safety proof | Simpler to verify | More complex |
Implementation complexity | Lower | Higher |
Real-world use | etcd, CockroachDB, TiKV, Consul | Google Chubby, Apache Zookeeper (ZAB, a Paxos variant), Google Spanner |
In practice, most new systems choose Raft. Paxos implementations tend to be in older or Google-scale systems where the engineering resources to implement it correctly exist.
Tools That Implement Consensus
You almost never implement Raft or Paxos yourself. You use a system that implements it:
etcd: Raft-based distributed key-value store. Used as the configuration backend for Kubernetes, and as the leader election and distributed lock service for many databases. PostgreSQL Patroni uses etcd for HA cluster coordination.
Consul (HashiCorp): Raft-based service mesh and key-value store. Used for leader election in many database HA setups.
Zookeeper: ZAB protocol (Paxos variant). Used for coordination in Kafka, HBase, and older Hadoop ecosystem databases.
PostgreSQL Patroni: Uses etcd/Consul/ZooKeeper as the consensus backend. Patroni itself handles PostgreSQL-specific failover logic (promoting a replica, reconfiguring replication) while delegating consensus to etcd.
# Patroni configuration (patroni.yml)
etcd:
hosts: etcd1:2379,etcd2:2379,etcd3:2379
bootstrap:
dcs:
ttl: 30 # leader lease TTL in seconds
loop_wait: 10 # seconds between Patroni checks
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1 MB: only elect replicas within 1MB of leader
postgresql:
parameters:
synchronous_commit: "on"
synchronous_standby_names: "ANY 1 (*)" # semi-sync: wait for 1 replicaHigh Availability SLA Mathematics
"High availability" means different things at different nines:
SLA Downtime per year Downtime per month Typical requirement
─────────────────────────────────────────────────────────────────────────
99% 87.6 hours 7.3 hours Single server, no HA
99.9% 8.76 hours 43.8 minutes Basic HA (manual failover)
99.99% 52.6 minutes 4.4 minutes Automatic failover < 1 min
99.999% 5.26 minutes 26 seconds Multi-region, zero-downtime
99.9999% 31.5 seconds 2.6 seconds Extremely rare, very expensiveAchieving 99.99%: Automatic failover must complete in under ~4 minutes per month on average. With Raft-based failover completing in 300ms–2 seconds, this is achievable with a 3-node cluster, provided:
Election timeout is tuned (150–300ms)
Network between nodes is reliable (< 10ms latency)
Leader health checks are frequent (every 100–500ms)
Connection pool retries on failover (application-level)
Achieving 99.999%: Requires zero-downtime operations — no single point where all writes are blocked. Typically requires:
Multi-region deployment with active-active replication
Automatic DNS failover or Anycast routing
Connection pooler that handles mid-transaction failover (PgBouncer, ProxySQL)
Database that supports online schema changes (no table-level locks)
Failure Scenarios and What Happens
Scenario 1: Leader crashes (no network partition)
→ Replicas stop receiving heartbeats
→ Election timeout fires (150–300ms)
→ Most up-to-date replica wins election
→ New leader elected in ~500ms–2s
→ Writes resume
→ Impact: brief write unavailability
Scenario 2: Leader network partition (leader alive, unreachable)
→ Same as Scenario 1 for the majority partition
→ Old leader: still alive, cannot reach majority → steps down
(Raft: old leader's term is lower than new leader's → rejects own writes)
→ No split-brain
→ Impact: brief write unavailability during election
Scenario 3: Replica crashes
→ Leader continues with remaining replicas
→ If sync replication and the crashed replica was the sync standby:
PostgreSQL falls back to async or blocks (depending on config)
→ Impact: potential write latency increase if sync replica was required
Scenario 4: All replicas crash, leader survives
→ Leader has no replicas to replicate to
→ Writes continue (async mode) or block (sync mode requiring N replicas)
→ High data loss risk if leader then crashes before replicas recover
→ Impact: durability degraded
Scenario 5: Majority of nodes crash (e.g. 2 of 3)
→ Remaining minority cannot form quorum
→ NO writes accepted (safety over availability — correct behavior)
→ Impact: full write unavailability until majority recoversChapter 4 Complete
You have finished the Replication & High Availability chapter:
✅ Post 12: Database Replication — Single-Leader, Multi-Leader & Leaderless
✅ Post 13: Replication Lag & Consistency Guarantees
✅ Post 14: Failover, Consensus & High Availability
Chapter 5 moves from keeping one database cluster available to the harder problem: scaling a database beyond what a single cluster can handle — partitioning, sharding, and the distributed systems trade-offs that come with it.
🧭 What's Next — Chapter 5: Scaling & Distributed Systems
Post 15: Vertical vs Horizontal Scaling — most databases start vertical and die horizontal; knowing when to stop scaling up and start scaling out is one of the most valuable senior engineering skills
Related
Replication Lag and Consistency: Read-Your-Own-Writes, Monotonic Reads and More
Replication lag creates bugs that only appear under load. Read-your-own-writes, monotonic reads, and consistent prefix reads are the guarantees that prevent them.
Database Replication: Single-Leader, Multi-Leader and Leaderless Explained
Replication keeps databases available when servers fail. Single-leader, multi-leader, and leaderless each trade consistency differently — this post explains all three clearly.
WAL, Crash Recovery and Durability: How Databases Survive Power Failures
How does a database survive a server crash without losing data? The Write-Ahead Log is the answer. This post explains WAL, ARIES recovery, and what durability actually guarantees.
Comments