WAL, Crash Recovery and Durability: How Databases Survive Power Failures
The Problem: Writes Are Not Atomic at the Hardware Level
In Post 05 we said that committed transactions are durable — they survive crashes. In Post 09 we said that writes go to the buffer pool first and are flushed to disk asynchronously. These two facts seem to contradict each other.
If a committed transaction's data lives in the buffer pool (memory) and has not been flushed to disk yet — and the server crashes — how does the database survive?
The answer is the Write-Ahead Log (WAL). Understanding WAL means understanding the mechanism behind the D in ACID, how databases achieve crash recovery in seconds not hours, and why certain configuration choices (fsync=off, large checkpoints) are dangerous rather than just risky.
The Core Principle: Write the Log Before the Data
The WAL rule is simple and absolute:
Before any change is applied to a data page, a log record describing that change must be written to the WAL and flushed to durable storage.
This means:
Data pages (heap pages, B-Tree pages) can be in memory without being on disk
The WAL record for every change to those pages must be on disk before the change is even applied
On crash recovery, the database replays the WAL to reconstruct any changes that did not make it to the data pages
Timeline of a committed write:
T=1 Client sends UPDATE accounts SET balance=900 WHERE id=1
T=2 Database creates WAL record: "Update page 42, slot 3: balance 1000→900, txn_id=500"
T=3 WAL record written to WAL buffer (memory)
T=4 WAL buffer flushed to WAL file on disk (fsync) ← durable
T=5 Data page 42 updated in buffer pool (memory) ← not yet on disk
T=6 Database sends "COMMIT OK" to client
T=7 Server crashes
T=8 Server restarts
T=9 Database reads WAL: sees committed txn_id=500 with un-applied change
T=10 Database re-applies the change to page 42 from WAL
T=11 Data is consistent — balance is 900 as committedThe key: at T=4, the WAL record is on durable storage. Even though the data page (T=5) never made it to disk before the crash (T=7), the WAL has everything needed to reconstruct it.
WAL Structure
The WAL is a sequential, append-only log file on disk. Each record describes one change to one page.
WAL File Structure:
┌─────────────────────────────────────────────────────────┐
│ WAL Record 1 │
│ LSN: 0/1000 (Log Sequence Number — monotonically inc) │
│ TxnID: 498 │
│ Type: INSERT │
│ Target: page (0, 5), slot 3 │
│ Before image: (empty) │
│ After image: id=1, name='Alice', balance=1000 │
├─────────────────────────────────────────────────────────┤
│ WAL Record 2 │
│ LSN: 0/1200 │
│ TxnID: 499 │
│ Type: UPDATE │
│ Target: page (0, 5), slot 3 │
│ Before image: balance=1000 │
│ After image: balance=900 │
├─────────────────────────────────────────────────────────┤
│ WAL Record 3 │
│ LSN: 0/1400 │
│ TxnID: 499 │
│ Type: COMMIT │
├─────────────────────────────────────────────────────────┤
│ ... │
└─────────────────────────────────────────────────────────┘LSN (Log Sequence Number): A monotonically increasing identifier for each WAL record. Every data page header stores the LSN of the last WAL record that modified it. During recovery, the database compares the page's LSN against the WAL to determine what needs to be replayed.
Every data page carries the LSN of the last change that touched it:
Page 42 header:
page_lsn = 0/1200 ← last WAL record applied to this page
WAL has records up to LSN 0/2400
→ Page 42 is behind — records from 0/1200 to 0/2400 need to be replayed for page 42Checkpoints: Limiting Recovery Time
Without checkpoints, crash recovery would require replaying the entire WAL history — potentially terabytes of log going back months. Checkpoints bound the recovery time.
A checkpoint forces all dirty pages in the buffer pool to be flushed to disk, then writes a checkpoint record to the WAL:
Checkpoint sequence:
1. Write "BEGIN CHECKPOINT" record to WAL
2. Flush all dirty buffer pool pages to disk (pages now match WAL)
3. Write "END CHECKPOINT" record to WAL with:
- LSN of this checkpoint
- List of transactions active at checkpoint time
4. WAL records before the checkpoint LSN can now be archived or deleted
(data pages are already on disk — those WAL records are no longer needed for recovery)
On crash recovery:
→ Find the last checkpoint record in WAL
→ Only replay WAL records AFTER the checkpoint LSN
→ Recovery time = time to replay WAL since last checkpointCheckpoint frequency trade-off:
Frequent checkpoints:
+ Faster crash recovery (less WAL to replay)
+ Less WAL storage needed
- More I/O during normal operation (more dirty page flushes)
Infrequent checkpoints:
+ Less I/O overhead during normal operation
- Slower crash recovery
- More WAL storage neededIn PostgreSQL, checkpoint_timeout (default 5 minutes) and max_wal_size (default 1 GB) control checkpoint frequency. If max_wal_size is exceeded, a checkpoint is forced regardless of time.
-- Check checkpoint frequency and duration
SELECT
checkpoints_timed,
checkpoints_req, -- forced by WAL size limit (should be low)
checkpoint_write_time / 1000.0 AS write_sec,
checkpoint_sync_time / 1000.0 AS sync_sec,
buffers_checkpoint,
buffers_clean,
buffers_backend -- dirty pages written by backends (bad — means checkpoint too slow)
FROM pg_stat_bgwriter;
-- High checkpoints_req = checkpoints happening too often = increase max_wal_size
-- High buffers_backend = checkpoints not keeping up = increase checkpoint_completion_targetARIES: The Recovery Algorithm
ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the standard crash recovery algorithm, used by PostgreSQL, InnoDB, DB2, and most serious databases. Published by IBM in 1992, it remains the foundation of modern database recovery.
ARIES recovery has three phases:
Phase 1: Analysis
Scan the WAL from the last checkpoint forward. Build two tables:
Transaction table: which transactions were active at crash time (committed, aborted, or in-progress)
Dirty page table: which data pages had changes in the WAL that may not have reached disk
Analysis result:
Transaction Table:
TxnID=498 status=COMMITTED (saw COMMIT record in WAL)
TxnID=499 status=COMMITTED (saw COMMIT record in WAL)
TxnID=500 status=IN-PROGRESS (no COMMIT or ABORT in WAL — was active when crash occurred)
Dirty Page Table:
page (0,5) recLSN=0/1000 (earliest WAL record that modified this page after last checkpoint)
page (3,2) recLSN=0/1800Phase 2: Redo
Replay all WAL records from the earliest dirty page LSN forward — even for transactions that will ultimately be rolled back. This restores the database to the exact state it was in at the moment of crash.
Redo phase: replay every change in WAL order
LSN 0/1000: TxnID=498 INSERT page(0,5) → apply
LSN 0/1200: TxnID=499 UPDATE page(0,5) → apply
LSN 0/1400: TxnID=499 COMMIT → mark as committed
LSN 0/1600: TxnID=500 UPDATE page(3,2) → apply (even though txn will be rolled back)
↑ Redo everything, committed or notWhy redo uncommitted transactions? Because some of their changes may have already been flushed to disk before the crash (buffer pool eviction). Redo restores a consistent on-disk state. Undo handles cleanup next.
Phase 3: Undo
Roll back all transactions that were in-progress at crash time (TxnID=500 in our example). The database reads their WAL records in reverse order and applies the "before image" to undo each change.
Undo phase: roll back TxnID=500
LSN 0/1600: TxnID=500 UPDATE page(3,2) balance 750→800
→ Apply before image: balance 800→750
→ Write "CLR" (Compensation Log Record) to WAL for this undo
CLR marks that this undo is durable — if we crash during undo, we do not redo the undo
TxnID=500 is now fully rolled back — as if it never happened.Compensation Log Records (CLRs): Undo operations are themselves logged. If the database crashes during the undo phase, it does not re-undo already-undone changes — it picks up from where undo left off. This makes ARIES idempotent: you can run recovery multiple times and always reach the same correct state.
Full ARIES recovery timeline:
┌─────────────────────────────────────────────────────────────┐
│ Last Checkpoint Crash Restart │
│ │ │ │ │
│ ─────●───────────────● │ │
│ ↑ ↑ │ │
│ Checkpoint LSN Server dies │ │
│ │ │
│ Phase 1: Analysis │
│ Scan WAL from checkpoint → crash │
│ Build txn table + dirty page table │
│ │ │
│ Phase 2: Redo │
│ Replay all WAL records from │
│ earliest dirty page LSN │
│ │ │
│ Phase 3: Undo │
│ Roll back in-progress transactions │
│ │ │
│ Database is consistent ✅ │
└─────────────────────────────────────────────────────────────┘fsync: The Hardware Contract
fsync() is the system call that tells the OS: "flush your write buffers to the physical disk and do not return until the data is on durable storage."
Without fsync, a "write" to disk may only reach the OS page cache — volatile memory that is lost in a power failure. The OS lies and says the write succeeded. fsync forces the truth.
Without fsync:
Application writes data → OS page cache (memory) → "success" returned
Power failure → OS page cache lost → data gone
With fsync:
Application writes data → OS page cache → fsync() → physical disk → "success" returned
Power failure → data on physical disk → survives
The WAL uses fsync for every COMMIT:
WAL record written → WAL buffer → fsync → physical disk → COMMIT OK returned to client
This is why COMMIT has latency: it waits for a physical disk write (1–10ms on HDD, 0.1–1ms on SSD/NVMe)fsync=off in PostgreSQL: Disables fsync entirely. Writes return without waiting for disk. Performance increases dramatically. Durability guarantee is completely removed — a power failure can corrupt the database in an unrecoverable way (not just lose recent changes — corrupt it, requiring a full restore from backup).
Never use fsync=off in production. It is appropriate only for ephemeral data (test databases, throwaway load testing environments) where data loss is acceptable.
Battery-backed write cache: Enterprise SSDs and RAID controllers with battery backup make fsync faster — the controller acknowledges the write when data reaches its battery-backed RAM, not the physical flash. This is safe (the battery holds data through power failure) and common in production database servers.
PostgreSQL WAL vs InnoDB Redo Log
Both PostgreSQL and InnoDB use WAL / redo logging, but with different implementations:
PostgreSQL WAL:
Physical WAL: records describe changes to specific page bytes (before/after images)
WAL records used for: crash recovery, replication (streaming replication sends WAL to replicas), point-in-time recovery (PITR)
WAL files: 16 MB segments in
pg_wal/directory, recycled after archiving
InnoDB Redo Log:
Logical-physical redo log: records describe operations at a higher level ("insert row X into page Y")
Separate from binlog (MySQL's statement-level replication log)
Circular redo log files (
ib_logfile0,ib_logfile1) — size controlled byinnodb_log_file_sizeUndo log: separate from redo log, stored in system tablespace, used for MVCC and rollback
MySQL write path:
Client write
│
▼
InnoDB Buffer Pool (memory)
│
├─→ InnoDB Redo Log (WAL equivalent) → disk (for durability)
│
└─→ MySQL Binlog → disk (for replication)
Two separate logs — both must be written for a durable, replicable commit.
innodb_flush_log_at_trx_commit=1 (default): fsync redo log at every commit (fully durable)
innodb_flush_log_at_trx_commit=2: write to OS cache at commit, fsync every second (faster, less durable)
sync_binlog=1: fsync binlog at every commit (required for safe replication)Point-in-Time Recovery (PITR)
WAL enables not just crash recovery but point-in-time recovery — restoring a database to any moment in the past, not just the last backup.
PITR setup:
1. Take a base backup (full copy of data directory) at T=Monday 00:00
2. Archive WAL segments continuously to S3 or similar storage
3. Database corrupted at T=Wednesday 14:23 (e.g. accidental DELETE)
Recovery:
Restore base backup from Monday 00:00
Replay archived WAL up to Wednesday 14:22 (one minute before the mistake)
Database is at the state it was in at Wednesday 14:22 — mistake never happened-- PostgreSQL PITR configuration (postgresql.conf):
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
-- %p = WAL segment path, %f = WAL segment filename
-- Restore command (recovery.conf / postgresql.conf in PG12+):
restore_command = 'aws s3 cp s3://my-wal-archive/%f %p'
recovery_target_time = '2025-03-19 14:22:00'
-- Database replays WAL until this timestamp, then stopsPITR is the foundation of managed database backup in AWS RDS, Google Cloud SQL, and similar services — they continuously archive WAL and let you restore to any second within the retention window.
WAL and Replication
In PostgreSQL, streaming replication uses the same WAL:
Primary server Replica server
────────────── ──────────────
Client write
│
▼
WAL record written ─────────────→ Replica receives WAL record
│ (streaming) Replica applies to its own data pages
▼ Replica is a real-time copy of primary
Data page updated in buffer pool This is why PostgreSQL replicas are sometimes called "WAL replicas" — they are continuously replaying the primary's WAL. The replication lag is the delay between the primary writing a WAL record and the replica applying it.
-- Check replication lag on primary
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;The Durability Guarantee in One Sentence
A committed transaction is durable if and only if its WAL record has been flushed to durable storage before the COMMIT acknowledgment is returned to the client.
Everything else — buffer pool, checkpoint timing, background writers — is an optimization. The WAL flush is the guarantee.
Chapter 3 Complete
You have finished the Storage Engine Internals chapter:
✅ Post 09: Pages, Heap Files & Row vs Column Storage
✅ Post 10: B-Tree vs LSM-Tree: InnoDB vs RocksDB
✅ Post 11: WAL, Crash Recovery & Durability
Chapter 4 moves from how data is stored on one machine to how databases stay available when machines fail — replication, consistency guarantees, and the distributed systems problems that make high availability genuinely hard.
🧭 What's Next — Chapter 4: Replication & High Availability
Post 12: Database Replication — single-leader, multi-leader, and leaderless replication each make fundamentally different consistency trade-offs; this post explains all three clearly
Related
B-Tree vs LSM-Tree: Why InnoDB and RocksDB Make Different Trade-offs
InnoDB uses B-Tree. RocksDB uses LSM-Tree. The choice determines read speed, write speed, and space amplification. Understanding both unlocks every storage engine decision.
How Databases Store Data: Pages, Heap Files and Row vs Column Storage
Every query you write becomes disk I/O. Understanding how databases store data — pages, heap files, row vs column layout — explains why some queries are fast and others are not.
Schema Design: Normalization vs Denormalization and When to Choose Each
Normalization keeps data consistent. Denormalization makes reads fast. Every real schema is a trade-off between them — this post teaches you how to make that call deliberately.
Comments