Failover, Consensus and High Availability: Raft, Paxos and Automatic Failover

The Failover Problem

The leader crashes. Every write is now blocked. You need a new leader immediately. You have three replicas — which one becomes the new leader?

This sounds simple. Pick the replica that is most up-to-date. Promote it. Done.

The problem: the old leader may not actually be dead. It may be alive but temporarily unreachable — a network partition, a long GC pause, a slow disk. If you promote a new leader while the old one is still accepting writes, you now have two leaders. Both accept writes. Neither knows about the other's writes. Your data diverges. This is the split-brain problem, and it is the reason automatic failover is hard.

Solving split-brain correctly requires consensus — a protocol that lets a group of nodes agree on exactly one value (the new leader) even when some nodes are unreachable. The two most important consensus algorithms are Paxos and Raft.

The Split-Brain Problem

Split-brain occurs when network partitions isolate nodes into two groups, each of which elects its own leader:

Normal state:
  [Leader] ←→ [Replica 1] ←→ [Replica 2]
  All three nodes communicate. One leader.

Network partition:
  [Leader] ✗✗✗ [Replica 1] ←→ [Replica 2]
  Leader is isolated. Replica 1 and 2 cannot reach it.

Without split-brain protection:
  Replica 1 and 2 elect Replica 1 as new leader.
  Original Leader is still alive and accepting writes.
  Now TWO leaders accept writes simultaneously.
  Data diverges. No automated way to reconcile.

With consensus (majority quorum):
  Replica 1 and 2 have 2 of 3 votes = majority → can elect new leader.
  Original Leader has 1 of 3 votes = minority → cannot continue as leader.
  Only ONE leader at any time.

The quorum rule: A leader can only operate if it has acknowledgment from a majority of nodes. A three-node cluster requires 2 votes. A five-node cluster requires 3 votes. The isolated minority cannot form a quorum — it cannot elect a leader or accept writes.

This is why HA clusters have an odd number of nodes: 3, 5, or 7. An even number creates a split-quorum edge case where both halves have exactly half the votes.

Manual vs Automatic Failover

Manual failover: A human operator decides when to promote a replica. Common in organizations with strict change management or where the cost of a wrong promotion (data loss) exceeds the cost of downtime.

Typical manual failover procedure:
1. Confirm the leader is actually dead (not a false alarm)
2. Identify the most up-to-date replica (highest replay LSN)
3. Promote the replica: SELECT pg_promote(); (PostgreSQL)
4. Update connection strings / DNS to point to new leader
5. Reconfigure old replicas to replicate from new leader
6. Investigate and rebuild the old leader before rejoining

Total time: 5–30 minutes depending on automation level

Automatic failover: The cluster detects failure and elects a new leader without human intervention. Required for SLAs of 99.99% (52 minutes downtime/year) or higher.

Typical automatic failover with a consensus protocol:
1. Leader stops sending heartbeats (crash or network partition)
2. After election timeout (e.g. 150–300ms), a replica starts an election
3. Candidate requests votes from all other nodes
4. Majority votes received → new leader elected
5. New leader begins accepting writes
6. Old replicas receive heartbeat from new leader → update their leader pointer

Total time: election timeout + one round trip = typically 300ms–2 seconds

Raft: Consensus Made Understandable

Raft was designed explicitly for understandability. Where Paxos is famously difficult to grasp, Raft decomposes consensus into three independent sub-problems: leader election, log replication, and safety.

Leader Election

Every node in a Raft cluster is in one of three states: follower, candidate, or leader.

State machine:

  ┌──────────────────────────────────────────────────────┐
  │                                                      │
  │   Follower ──────election timeout──────→ Candidate   │
  │      ↑                                      │        │
  │      │                                      │        │
  │   receives              wins majority vote  │        │
  │   heartbeat             (becomes leader)    │        │
  │   from leader                               ↓        │
  │      └────────────────────────────────── Leader      │
  │                                              │        │
  │                         discovers higher     │        │
  │   Candidate ←──────────term leader──────────┘        │
  │      │                 (steps down)                   │
  │      └──────loses election or timeout──→ Follower    │
  └──────────────────────────────────────────────────────┘

Terms: Raft divides time into terms — monotonically increasing integers. Each term begins with an election. If a candidate wins the election, it serves as leader for the rest of the term. If no candidate wins (split vote), the term ends with no leader and a new election begins.

Terms act as a logical clock. A node that receives a message from a higher term immediately updates its term and converts to follower — this is how stale leaders (isolated leaders who missed an election) discover they have been replaced.

Election process:

1. Follower's election timer expires (no heartbeat received)
2. Follower increments its term (term 4 → term 5)
3. Follower becomes Candidate, votes for itself
4. Candidate sends RequestVote RPC to all other nodes:
   {term: 5, candidateId: node2, lastLogIndex: 42, lastLogTerm: 4}

5. Each node responds YES if:
   - It has not already voted in term 5
   - Candidate's log is at least as up-to-date as its own
     (lastLogTerm > my log term, OR same term and lastLogIndex >= my log index)

6. Candidate receives majority → becomes Leader for term 5
7. Leader immediately sends heartbeat to all nodes to assert authority

Why the log up-to-date check matters: Raft only elects leaders that have all committed entries. If a replica is missing recent committed writes, it cannot become leader — those writes would be lost. This is the "election restriction" that ensures Raft never loses committed data.

Log Replication

Once elected, the leader handles all client writes. It appends each write to its log, then replicates the log entry to followers:

Log replication sequence:

Client: write "balance=900"
  │
  ▼
Leader appends to its log:
  Index 43: {term:5, command: "balance=900"}

Leader sends AppendEntries RPC to all followers:
  {term:5, leaderId:node2, prevLogIndex:42, prevLogTerm:4,
   entries: [{index:43, term:5, command:"balance=900"}],
   leaderCommit: 42}

Followers append to their logs, respond OK.

When majority has responded OK:
  Leader marks entry 43 as COMMITTED
  Leader applies entry 43 to its state machine
  Leader sends response to client: SUCCESS
  Leader includes commitIndex in next heartbeat
  Followers apply committed entries to their state machines

Safety guarantee: An entry is committed only after a majority has written it to their logs. If the leader crashes, any node with a majority can still reconstruct all committed entries.

Handling Leader Crashes Mid-Replication

Scenario: Leader sends entry 43 to some followers, then crashes before commit

Node 1 (new leader): has entry 43  ✅
Node 2 (follower):   has entry 43  ✅
Node 3 (follower):   missing 43   ❌

Node 1 wins election (has entry 43, matches majority).
Node 1 becomes new leader.
Node 1 replicates entry 43 to Node 3 as part of normal log replication.
Entry 43 is now on majority → committed.
No data loss.

What if the old leader had an entry that was NOT on a majority?

Node 1 (old leader): has entry 43 (not replicated before crash)
Node 2:              does NOT have entry 43
Node 3:              does NOT have entry 43

Node 2 wins election (entry 43 was never on majority — not committed).
Node 2 becomes new leader.
Node 2 overwrites Node 1's entry 43 with its own entries when Node 1 rejoins.
Entry 43 is lost — BUT it was never committed, so no committed data is lost.

This is correct behavior: uncommitted entries may be lost on leader crash, but committed entries are always preserved.

Paxos: The Original Consensus Algorithm

Paxos was described by Leslie Lamport in 1989 (published 1998). It is provably correct and forms the theoretical basis for most consensus implementations. It is also famously difficult to understand and implement.

Single-decree Paxos (the basic form) reaches agreement on a single value in two phases:

Phase 1 — Prepare:
  Proposer sends Prepare(n) to majority of acceptors
  n = proposal number (must be unique and higher than any seen)
  
  Acceptors respond with:
  - Promise not to accept proposals with number < n
  - The highest-numbered proposal they have already accepted (if any)

Phase 2 — Accept:
  If proposer receives promises from majority:
    If any acceptor returned an already-accepted value v,
      proposer must use v (cannot choose its own value)
    Otherwise proposer chooses its own value
  
  Proposer sends Accept(n, v) to majority
  Acceptors accept unless they promised a higher number
  
  Value is chosen when majority accepts it

Multi-Paxos extends single-decree Paxos to a replicated log (the practical need). A stable leader skips Phase 1 for subsequent proposals — only doing Phase 2. This optimization makes Multi-Paxos efficient for continuous log replication.

Paxos vs Raft:

	Raft	Paxos (Multi-Paxos)
Understandability	Designed for clarity	Notoriously difficult
Leader election	Explicit, term-based	Implicit (Phase 1 doubles as election)
Log replication	Explicit leader-to-follower	Proposer drives each entry
Safety proof	Simpler to verify	More complex
Implementation complexity	Lower	Higher
Real-world use	etcd, CockroachDB, TiKV, Consul	Google Chubby, Apache Zookeeper (ZAB, a Paxos variant), Google Spanner

In practice, most new systems choose Raft. Paxos implementations tend to be in older or Google-scale systems where the engineering resources to implement it correctly exist.

Tools That Implement Consensus

You almost never implement Raft or Paxos yourself. You use a system that implements it:

etcd: Raft-based distributed key-value store. Used as the configuration backend for Kubernetes, and as the leader election and distributed lock service for many databases. PostgreSQL Patroni uses etcd for HA cluster coordination.

Consul (HashiCorp): Raft-based service mesh and key-value store. Used for leader election in many database HA setups.

Zookeeper: ZAB protocol (Paxos variant). Used for coordination in Kafka, HBase, and older Hadoop ecosystem databases.

PostgreSQL Patroni: Uses etcd/Consul/ZooKeeper as the consensus backend. Patroni itself handles PostgreSQL-specific failover logic (promoting a replica, reconfiguring replication) while delegating consensus to etcd.

# Patroni configuration (patroni.yml)
etcd:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30                    # leader lease TTL in seconds
    loop_wait: 10              # seconds between Patroni checks
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1 MB: only elect replicas within 1MB of leader

postgresql:
  parameters:
    synchronous_commit: "on"
    synchronous_standby_names: "ANY 1 (*)"  # semi-sync: wait for 1 replica

High Availability SLA Mathematics

"High availability" means different things at different nines:

SLA        Downtime per year    Downtime per month    Typical requirement
─────────────────────────────────────────────────────────────────────────
99%        87.6 hours           7.3 hours             Single server, no HA
99.9%      8.76 hours           43.8 minutes          Basic HA (manual failover)
99.99%     52.6 minutes         4.4 minutes           Automatic failover < 1 min
99.999%    5.26 minutes         26 seconds            Multi-region, zero-downtime
99.9999%   31.5 seconds         2.6 seconds           Extremely rare, very expensive

Achieving 99.99%: Automatic failover must complete in under ~4 minutes per month on average. With Raft-based failover completing in 300ms–2 seconds, this is achievable with a 3-node cluster, provided:

Election timeout is tuned (150–300ms)
Network between nodes is reliable (< 10ms latency)
Leader health checks are frequent (every 100–500ms)
Connection pool retries on failover (application-level)

Achieving 99.999%: Requires zero-downtime operations — no single point where all writes are blocked. Typically requires:

Multi-region deployment with active-active replication
Automatic DNS failover or Anycast routing
Connection pooler that handles mid-transaction failover (PgBouncer, ProxySQL)
Database that supports online schema changes (no table-level locks)

Failure Scenarios and What Happens

Scenario 1: Leader crashes (no network partition)
  → Replicas stop receiving heartbeats
  → Election timeout fires (150–300ms)
  → Most up-to-date replica wins election
  → New leader elected in ~500ms–2s
  → Writes resume
  → Impact: brief write unavailability

Scenario 2: Leader network partition (leader alive, unreachable)
  → Same as Scenario 1 for the majority partition
  → Old leader: still alive, cannot reach majority → steps down
     (Raft: old leader's term is lower than new leader's → rejects own writes)
  → No split-brain
  → Impact: brief write unavailability during election

Scenario 3: Replica crashes
  → Leader continues with remaining replicas
  → If sync replication and the crashed replica was the sync standby:
     PostgreSQL falls back to async or blocks (depending on config)
  → Impact: potential write latency increase if sync replica was required

Scenario 4: All replicas crash, leader survives
  → Leader has no replicas to replicate to
  → Writes continue (async mode) or block (sync mode requiring N replicas)
  → High data loss risk if leader then crashes before replicas recover
  → Impact: durability degraded

Scenario 5: Majority of nodes crash (e.g. 2 of 3)
  → Remaining minority cannot form quorum
  → NO writes accepted (safety over availability — correct behavior)
  → Impact: full write unavailability until majority recovers

Chapter 4 Complete

You have finished the Replication & High Availability chapter:

✅ Post 12: Database Replication — Single-Leader, Multi-Leader & Leaderless
✅ Post 13: Replication Lag & Consistency Guarantees
✅ Post 14: Failover, Consensus & High Availability

Chapter 5 moves from keeping one database cluster available to the harder problem: scaling a database beyond what a single cluster can handle — partitioning, sharding, and the distributed systems trade-offs that come with it.

🧭 What's Next — Chapter 5: Scaling & Distributed Systems

Post 15: Vertical vs Horizontal Scaling — most databases start vertical and die horizontal; knowing when to stop scaling up and start scaling out is one of the most valuable senior engineering skills

The Failover Problem

The Split-Brain Problem

Manual vs Automatic Failover

Raft: Consensus Made Understandable

Leader Election

Log Replication

Handling Leader Crashes Mid-Replication

Paxos: The Original Consensus Algorithm

Tools That Implement Consensus

High Availability SLA Mathematics

Failure Scenarios and What Happens

Chapter 4 Complete

🧭 What's Next — Chapter 5: Scaling & Distributed Systems

Related

Replication Lag and Consistency: Read-Your-Own-Writes, Monotonic Reads and More

Database Replication: Single-Leader, Multi-Leader and Leaderless Explained

WAL, Crash Recovery and Durability: How Databases Survive Power Failures

Leave a comment

Comments