How Databases Store Data: Pages, Heap Files and Row vs Column Storage
Why Storage Internals Matter
Most developers treat the database as a black box: send SQL, get rows. That abstraction holds until it does not — until a query that "should be fast" takes 30 seconds, until a table with only 10,000 rows consumes 2 GB of disk, until VACUUM takes hours and the database freezes.
These problems all have physical explanations. The database stores data on disk in a specific structure. Queries read that structure in specific patterns. Understanding those patterns tells you:
Why an index lookup on a fragmented table is slower than expected
Why a query fetching 3 columns from a 100-column table reads far more data than it should
Why a table accumulates dead space that never gets reclaimed without VACUUM
Why column-oriented databases are 10× faster than row-oriented for analytics
None of this requires understanding kernel internals or hardware architecture. You only need to understand one abstraction: the page.
The Page: The Fundamental Unit of Storage
Databases do not read or write individual rows. They read and write pages — fixed-size blocks of disk space. Every read, every write, every cache operation happens at page granularity.
PostgreSQL: 8 KB pages (configurable at compile time, rarely changed) MySQL InnoDB: 16 KB pages (configurable via innodb_page_size) SQL Server: 8 KB pages Oracle: 2–32 KB pages (default 8 KB)
Why fixed-size pages? Because the operating system and disk hardware also operate in fixed-size blocks (typically 4 KB sectors). Fixed-size pages align with hardware, simplify buffer pool management, and make space reclamation predictable.
Disk Layout:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │
│ 8 KB │ 8 KB │ 8 KB │ 8 KB │ 8 KB │
└──────────┴──────────┴──────────┴──────────┴──────────┘
↑
One read syscall fetches exactly one page (8 KB)
even if you only needed one row (100 bytes)The implication: When you read one row, you read an entire page. If that row is surrounded by 79 other rows you do not need, you still paid the I/O cost for all 80. This is why reducing the number of pages a query must touch is the most direct path to making it faster.
The Heap File: Where Table Data Lives
A table's data is stored in a heap file — a collection of pages, stored in no particular order. "Heap" here does not mean the data structure; it means an unordered pile.
Each page in a heap file is a slotted page with three regions:
┌─────────────────────────────────────────────────────────┐
│ PAGE (8 KB) │
├─────────────────────────────────────────────────────────┤
│ Page Header (24 bytes) │
│ - Page LSN (last WAL record for this page) │
│ - Free space pointers │
│ - Checksum │
├─────────────────────────────────────────────────────────┤
│ Item ID Array (slot array, grows downward →) │
│ [slot 1 offset | len] [slot 2 offset | len] ... │
├──────────────────────────────────┬──────────────────────┤
│ │ Free Space │
│ │ (grows toward │
│ │ each other) │
│ ← Tuple Data (grows upward) │ │
├──────────────────────────────────┴──────────────────────┤
│ Tuple N │ Tuple 3 │ Tuple 2 │ Tuple 1 │
│ (newest) │ │ │ (oldest) │
└─────────────────────────────────────────────────────────┘Item ID Array (slot array): Fixed-size entries at the top of the page, each pointing to the offset and length of a tuple. This indirection layer is critical — it allows tuples to be moved within a page (for compaction) without updating any external pointers. External references (indexes, foreign keys) point to (page_number, slot_number), not to a raw byte offset.
Tuple data: Rows grow upward from the bottom of the page. Free space sits between the slot array and the tuples.
A row's physical address is a ctid in PostgreSQL — (page_number, slot_number):
SELECT ctid, id, name FROM users LIMIT 5;
-- ctid | id | name
-- (0,1) | 1 | Alice ← page 0, slot 1
-- (0,2) | 2 | Bob ← page 0, slot 2
-- (0,3) | 3 | Carol ← page 0, slot 3
-- (1,1) | 4 | Dave ← page 1, slot 1 (row didn't fit in page 0)
-- (1,2) | 5 | Eve ← page 1, slot 2Tuple Layout: What a Row Looks Like on Disk
Each tuple (row) stored on a page has a header followed by the actual column data:
Tuple structure in PostgreSQL:
┌──────────────────────────────────────────────────────┐
│ Tuple Header (23 bytes) │
│ - xmin: transaction ID that inserted this row │
│ - xmax: transaction ID that deleted this row │
│ - infomask: flags (has nulls? has varlen? deleted?) │
│ - ctid: pointer to newer version (for MVCC updates) │
├──────────────────────────────────────────────────────┤
│ Null bitmap (1 bit per column, only if NULLs exist) │
├──────────────────────────────────────────────────────┤
│ Column 1 data │ Column 2 data │ Column 3 data │
│ (fixed-length │ (fixed-length) │ (pointer to │
│ stored inline)│ │ TOAST table │
│ │ │ if too large) │
└──────────────────────────────────────────────────────┘xmin and xmax are the MVCC version markers. When you UPDATE a row, PostgreSQL does not modify the existing tuple — it inserts a new version with a new xmin and marks the old version with an xmax (the deleting transaction's ID). This is how MVCC provides snapshot isolation: old versions remain visible to transactions that started before the update.
The UPDATE cost: Every UPDATE in PostgreSQL is actually an INSERT + a soft delete of the old version. This produces dead tuples — old row versions that are no longer visible to any active transaction but still occupy disk space.
Dead Tuples and VACUUM
Dead tuples are the storage tax of MVCC. They accumulate over time as rows are updated and deleted. A table that is updated frequently can grow to 10× its logical size if VACUUM does not run.
Initial state: 3 live tuples
┌────────────────────────────────┐
│ (0,1) Alice, balance=1000 │ live
│ (0,2) Bob, balance=500 │ live
│ (0,3) Carol, balance=2000 │ live
└────────────────────────────────┘
After UPDATE accounts SET balance = 1500 WHERE name = 'Alice':
┌────────────────────────────────┐
│ (0,1) Alice, balance=1000 DEAD │ ← old version, xmax set
│ (0,2) Bob, balance=500 │ live
│ (0,3) Carol, balance=2000 │ live
│ (0,4) Alice, balance=1500 │ ← new version, xmin set
└────────────────────────────────┘
After 1000 more updates to Alice:
Page fills with dead tuples. Sequential scan reads all of them.
Index points to ctid of new version — but must skip dead versions.VACUUM reclaims dead tuple space by marking their slots as reusable (not immediately returning space to OS). VACUUM FULL rewrites the entire table, returning space to OS — but takes an exclusive lock.
-- Check dead tuple ratio per table
SELECT
relname,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup::NUMERIC / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- If dead_pct > 10–20%, autovacuum may be too slow — tune:
-- autovacuum_vacuum_scale_factor = 0.01 (trigger at 1% dead, not 20%)
-- autovacuum_vacuum_cost_delay = 2ms (run faster, more I/O)TOAST: Storing Large Values
PostgreSQL pages are 8 KB. What happens when a single value is larger — a 1 MB JSON blob, a 500 KB text field, a binary attachment?
TOAST (The Oversized-Attribute Storage Technique) handles this automatically:
Values under 2 KB: stored inline in the tuple
Values 2 KB – ~1 GB: compressed and/or moved to a separate TOAST table
Values over 1 GB: not supported (use large objects or external storage)
Main table page:
┌──────────────────────────────────────────────────┐
│ id=1 | name='Alice' | bio=<TOAST pointer> | ... │
└──────────────────────────────────────────────────┘
│
▼
TOAST table (pg_toast_NNNN):
┌──────────────────────────────────────────────────┐
│ chunk_id=1, chunk_seq=0 | chunk_data=<1KB chunk> │
│ chunk_id=1, chunk_seq=1 | chunk_data=<1KB chunk> │
│ chunk_id=1, chunk_seq=2 | chunk_data=<rest> │
└──────────────────────────────────────────────────┘Performance implication: Queries that do not select the TOAST column pay no I/O cost for it. SELECT id, name FROM users never touches the TOAST table even if bio has 500 KB values. SELECT * or SELECT bio does.
This is another reason SELECT * is an anti-pattern — it may silently trigger TOAST decompression for every row.
The Buffer Pool: The Page Cache
Reading from disk is slow. Reading from memory is fast. The buffer pool (shared memory buffer cache) is the database's in-memory cache of recently accessed pages.
Query execution path:
1. Query needs page (3, 0) from table orders
2. Check buffer pool: is page (3, 0) already in memory?
├─ YES (cache hit) → read directly from memory (~100ns)
└─ NO (cache miss) → read page from disk (~1ms SSD, ~10ms HDD)
load into buffer pool
return data to queryPostgreSQL buffer pool: controlled by shared_buffers (default 128 MB, recommended 25% of RAM for dedicated DB servers). A buffer pool that is too small causes frequent cache misses — every query hits the disk.
-- Check buffer pool hit rate (should be > 99% for a well-tuned server)
SELECT
SUM(heap_blks_hit) AS cache_hits,
SUM(heap_blks_read) AS disk_reads,
ROUND(
SUM(heap_blks_hit)::NUMERIC
/ NULLIF(SUM(heap_blks_hit) + SUM(heap_blks_read), 0) * 100, 2
) AS hit_rate_pct
FROM pg_statio_user_tables;
-- If hit_rate_pct < 99%: increase shared_buffers or add RAMDirty pages: When a query modifies a page in the buffer pool, the page is marked "dirty" — it has been changed in memory but not yet written to disk. The background writer and checkpointer flush dirty pages to disk periodically. Until then, the modified data exists only in memory — protected by WAL (covered in Post 11).
Row-Oriented vs Column-Oriented Storage
This is the most important storage trade-off for anyone working at scale.
Row-oriented storage (PostgreSQL, MySQL, SQLite): all columns of a row are stored together on disk. Reading a row requires touching one small region of one page.
Row-oriented layout:
Page 0:
[id=1, name='Alice', age=30, email='a@x.com', balance=1000, country='US', ...]
[id=2, name='Bob', age=25, email='b@x.com', balance=500, country='UK', ...]
[id=3, name='Carol', age=35, email='c@x.com', balance=2000, country='US', ...]Column-oriented storage (Redshift, BigQuery, Snowflake, ClickHouse, DuckDB): all values of a column are stored together. Reading a column scans one contiguous region; columns you do not need are never touched.
Column-oriented layout:
id column: [1, 2, 3, 4, 5, ...]
name column: ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', ...]
age column: [30, 25, 35, 28, 42, ...]
balance column: [1000, 500, 2000, 750, 3500, ...]The performance difference is dramatic for analytics:
-- Query: total balance by country across 100 million rows, fetching 2 of 50 columns
SELECT country, SUM(balance) FROM users GROUP BY country;
-- Row-oriented: reads ALL 50 columns for every row (100M × full row width)
-- Reads ~50 GB of data even though only 2 columns are needed
-- Column-oriented: reads ONLY the country and balance columns
-- Reads ~1.6 GB (country + balance columns only, compressed)
-- ~30× less I/O → ~30× fasterWhy does this also compress better? A column of the same data type contains similar values — ages clustered between 18–80, countries with high repetition (US appears millions of times). Run-length encoding and dictionary encoding compress these columns 10–50×, further reducing I/O.
Row-Oriented | Column-Oriented | |
|---|---|---|
Best for | OLTP — fetch a few complete rows | OLAP — aggregate across many rows |
Point lookup (by PK) | Fast — all data in one page region | Slow — must read from multiple column files |
Aggregate query | Slow — reads all columns, filters many | Fast — reads only needed columns |
UPDATE one row | Fast — one write to one page region | Slow — must update every column file |
Compression | Moderate | Excellent (similar values per column) |
Examples | PostgreSQL, MySQL, SQLite | Redshift, BigQuery, Snowflake, DuckDB |
PostgreSQL with columnar extension: pg_mooncake and the citus_columnar access method add columnar storage to PostgreSQL for analytics workloads, letting you mix row and column tables in the same database.
Putting It Together: What Happens When You Run a Query
SELECT name, balance FROM users WHERE id = 42;Parser validates SQL, builds parse tree
Planner checks for index on
id→ finds primary key B-Tree indexExecutor uses index to find
ctid = (page 0, slot 42)Buffer pool check: is page 0 in memory?
Cache hit: read tuple from memory
Cache miss: read page 0 from disk (8 KB) into buffer pool, then read tuple
Tuple visibility check: is this tuple visible to my transaction snapshot? (xmin/xmax check)
TOAST check: are
nameorbalanceTOASTed? (No, they are small — inline)Return
nameandbalanceto client
Total data read from disk (cache miss): 8 KB for one page — even though the row is 50 bytes. Everything else on that page was loaded unnecessarily. This is why high-selectivity queries on well-indexed tables still benefit from keeping rows compact (narrow tables, no unnecessary columns).
🧭 What's Next
Post 10: B-Tree vs LSM-Tree — InnoDB and RocksDB make fundamentally different storage engine trade-offs; understanding them explains every read/write performance characteristic you will encounter in production
Related
Schema Design: Normalization vs Denormalization and When to Choose Each
Normalization keeps data consistent. Denormalization makes reads fast. Every real schema is a trade-off between them — this post teaches you how to make that call deliberately.
Query Optimization: How to Read EXPLAIN Plans and Fix Slow Queries
A slow query in production is a fire. EXPLAIN plans show exactly where the database wastes time — if you know how to read them. This post teaches you how, with real examples.
Advanced SQL: Window Functions, CTEs and Recursive Queries
Window functions and CTEs turn 20-line subquery nightmares into clean SQL. Master ROW_NUMBER, RANK, LAG/LEAD, recursive CTEs, and the patterns that appear in real codebases.
Comments