B-Tree vs LSM-Tree: Why InnoDB and RocksDB Make Different Trade-offs

The Same Problem, Two Opposite Answers

The core problem of a storage engine is simple: given a stream of reads and writes from the database, how do you organize data on disk to serve both as efficiently as possible?

B-Tree and LSM-Tree answer this question in opposite ways:

B-Tree: Organize data in a sorted, balanced tree on disk. Reads are fast because data is always in sorted order and lookups are O(log n). Writes update the tree in-place — fast for small writes, but each write may require random I/O to wherever the target page lives on disk.

LSM-Tree (Log-Structured Merge-Tree): Never update data in-place. Buffer all writes in memory, then flush to disk sequentially in sorted batches. Writes are always sequential (fast). Reads must search across multiple sorted structures — slightly more work.

Neither is universally better. The choice is a deliberate trade-off between write performance, read performance, and storage efficiency. Understanding both tells you why MySQL and PostgreSQL behave differently under write-heavy load, why RocksDB powers Cassandra, MyRocks, and TiKV, and when to choose one over the other.

Part 1: B-Tree Storage Engine

How a B-Tree Works on Disk

A B-Tree is a self-balancing, sorted tree stored on disk. Each node is one disk page. Internal nodes contain keys and child pointers. Leaf nodes contain keys and either the actual data (clustered index) or pointers to data rows (non-clustered index).

B-Tree for a primary key index on orders.id:

                    ┌──────────────┐
                    │  [50 | 100]  │  ← root node (internal)
                    └──────────────┘
                    /       |       \
          ┌────────┐   ┌─────────┐   ┌──────────┐
          │[10|25] │   │[60|80]  │   │[110|150] │  ← internal nodes
          └────────┘   └─────────┘   └──────────┘
          /  |   \
  ┌──────┐ ┌──────┐ ┌──────┐
  │1→row │ │11→row│ │26→row│  ← leaf nodes (keys + row data or pointers)
  │2→row │ │12→row│ │27→row│
  │...   │ │...   │ │...   │
  └──────┘ └──────┘ └──────┘
  Leaf nodes linked → sequential range scan without going back up the tree

Lookup: Start at root, compare key, follow child pointer, repeat until leaf. O(log n) disk reads — typically 3–5 for a table with millions of rows.

Range scan: Find the leftmost leaf via lookup, then follow the linked list of leaf nodes. Efficient sequential I/O.

Insert/Update: Find the correct leaf page, insert the row. If the page is full, split it into two pages and propagate a new key up to the parent. Splits can cascade up the tree.

InnoDB: Clustered B-Tree Index

MySQL InnoDB stores the primary key index as a clustered index — the actual row data lives in the B-Tree leaf nodes, sorted by primary key. There is no separate heap file.

InnoDB clustered index (PRIMARY KEY on id):

Leaf node:
┌────────────────────────────────────────────────────┐
│  id=1  │ name='Alice' │ balance=1000 │ email='...' │  ← full row data
│  id=2  │ name='Bob'   │ balance=500  │ email='...' │
│  id=3  │ name='Carol' │ balance=2000 │ email='...' │
└────────────────────────────────────────────────────┘

Benefit: Primary key lookups fetch the complete row in one tree traversal. No separate heap read.

Cost — secondary index double lookup: Secondary indexes in InnoDB store the primary key value (not a page pointer) in their leaf nodes. A query using a secondary index must:

Search the secondary index → find primary key value
Search the clustered index with that primary key → fetch the row

This double lookup is called a bookmark lookup. For every row returned by a secondary index, InnoDB performs two B-Tree traversals.

-- Query using secondary index on email:
SELECT * FROM users WHERE email = 'alice@example.com';

-- Step 1: search idx_email → find id=42
-- Step 2: search clustered index with id=42 → fetch full row
-- Two B-Tree traversals total

-- Covered query avoids step 2:
SELECT id, email FROM users WHERE email = 'alice@example.com';
-- If index includes (email, id), step 2 is skipped — id is already in the index

Write amplification in B-Tree:

The worst case for B-Tree writes is random inserts into a large table — each insert may land in a different leaf page scattered across disk, causing random I/O. This is the B-Tree's weakness: write amplification from random I/O at scale.

100,000 random inserts into a 10M-row B-Tree:
Each insert → find leaf page (3-5 disk reads) → write updated page (1 disk write)
Potential page splits: add 1-2 more writes
Total: ~5-7 disk operations per insert → ~500,000–700,000 disk ops for 100K inserts

Part 2: LSM-Tree Storage Engine

The Core Insight: Make All Writes Sequential

Disk I/O is asymmetric — sequential reads and writes are far faster than random ones (even on SSDs, sequential I/O is 2–10× faster due to prefetching and write coalescing). LSM-Tree is designed to exploit this asymmetry.

The insight: instead of updating data in-place (random I/O), buffer all writes in memory and flush them to disk sequentially in sorted batches. Never overwrite old data. Use background compaction to merge and clean up.

The LSM-Tree Structure

Write path:

New writes
    │
    ▼
┌──────────────────────┐
│   MemTable (memory)  │  ← sorted, in-memory write buffer (typically 64–256 MB)
│   (Red-Black Tree    │     All writes go here first: O(log n) insert
│    or Skip List)     │
└──────────────────────┘
    │ When MemTable is full, flush to disk as an immutable SSTable
    ▼
┌──────────────────────┐
│   L0 SSTable files   │  ← Level 0: freshly flushed, may overlap in key range
│   (sorted, immutable)│
└──────────────────────┘
    │ Background compaction merges and sorts SSTables into deeper levels
    ▼
┌──────────────────────┐
│   L1 SSTables        │  ← Level 1: non-overlapping key ranges, 10× larger than L0
└──────────────────────┘
    ▼
┌──────────────────────┐
│   L2 SSTables        │  ← Level 2: 10× larger than L1
└──────────────────────┘
    ▼  ...

SSTable (Sorted String Table): An immutable, sorted file on disk. Once written, never modified. Updates are new entries; deletes are tombstone markers (a special entry saying "this key is deleted").

The Write Path

Write key=42, value=updated_row:

1. Write to WAL (for crash recovery) — sequential, fast
2. Insert into MemTable — in-memory sorted structure, O(log n)
3. Return success to client — write is done

When MemTable fills:
4. Flush MemTable to a new L0 SSTable file — sequential write
5. MemTable is cleared, new writes go to a fresh MemTable

Background (asynchronous):
6. Compact L0 SSTables into L1 (merge-sort multiple files into one)
7. Compact L1 into L2 as L1 fills, and so on

Write latency: Steps 1–2 are in-memory. The actual disk I/O (step 4) happens asynchronously. This is why LSM-Tree engines achieve extremely high write throughput — the write path is almost entirely memory operations followed by a sequential flush.

The Read Path

Reading is more complex — the key might exist in the MemTable, in any L0 SSTable, or in any deeper level:

Read key=42:

1. Check MemTable → not found
2. Check L0 SSTables (newest first, may overlap) → not found
3. Check L1 SSTable for this key range → not found
4. Check L2 SSTable for this key range → found! Return value.

Worst case: must check every level before finding the key (or confirming it does not exist)

Bloom filters: To avoid reading from SSTables that do not contain a key, each SSTable has a Bloom filter — a probabilistic data structure that answers "does this key exist in this SSTable?" in O(1) with zero false negatives (if Bloom filter says no, the key is definitely not there) and a small false positive rate (~1%).

Without Bloom filters: read might check 10 SSTables = 10 disk reads
With Bloom filters: eliminates most SSTables = typically 1–2 disk reads

The Three Amplification Factors

Storage engine performance is characterized by three amplification factors. B-Tree and LSM-Tree sit at opposite ends of each:

Write Amplification

How many bytes are written to disk for every byte of user data written?

B-Tree: Each write updates a page (8 KB) even if only one row (100 bytes) changed. Ratio ≈ 80×. Plus, page splits duplicate data across levels temporarily.

LSM-Tree: Each byte is written to WAL once, flushed in MemTable once, then rewritten during each compaction level it passes through. Ratio ≈ 10–30× depending on compaction strategy — significantly lower than B-Tree for random writes.

Read Amplification

How many disk reads are required to serve one read request?

B-Tree: O(log n) — typically 3–5 reads for the tree traversal. Predictable. Well-cached in buffer pool for hot data.

LSM-Tree: Must potentially check MemTable + all SSTable levels. Without Bloom filters: O(levels). With Bloom filters: typically 1–2 disk reads for point lookups. But range scans across multiple SSTables can be slower than B-Tree range scans.

Space Amplification

How much disk space is used relative to the logical data size?

B-Tree: Page fragmentation from splits and deletions wastes space. Dead space is reclaimed by VACUUM-equivalent operations. Typically 1.2–1.5× space amplification.

LSM-Tree: Data exists in multiple levels simultaneously during compaction. Old versions of keys persist until compacted away. Tombstones occupy space until the compaction that removes them. Typically 1.1–2× space amplification (better than B-Tree when write-heavy, worse during compaction spikes).

Summary:

               Write Amp    Read Amp    Space Amp    Best For
B-Tree         High         Low         Medium       Read-heavy, mixed OLTP
LSM-Tree       Medium       Higher      Low-Med      Write-heavy, time-series, key-value

Compaction Strategies

Compaction is the background process that merges SSTables across levels. Different compaction strategies trade off between write amplification, space amplification, and read amplification.

Leveled Compaction (RocksDB default, LevelDB)

Each level has a size limit. When a level exceeds its limit, SSTables are merged into the next level. Within each level (except L0), key ranges do not overlap — reading any key requires at most one SSTable per level.

L0: [1-100] [50-200] [150-300]    ← may overlap (freshly flushed)
L1: [1-50] [51-100] [101-150]     ← non-overlapping (after compaction)
L2: [1-25] [26-50] ...            ← non-overlapping, 10× larger than L1

Write amplification: moderate (data is rewritten multiple times across levels)
Read amplification: low (at most one SSTable per level per read)
Space amplification: low (old data compacted away quickly)

Best for: Read-heavy workloads, general-purpose databases (Cassandra uses this, RocksDB default).

Size-Tiered Compaction (Cassandra default, HBase)

When N SSTables of similar size accumulate, merge them into one larger SSTable. No strict level structure.

Tier 1 (small): [file1] [file2] [file3] [file4] → merge → Tier 2 (medium)
Tier 2 (medium): [file5] [file6] [file7] → merge → Tier 3 (large)

Write amplification: low (each byte written fewer times than leveled)
Read amplification: high (must check all SSTables of all tiers)
Space amplification: high (old versions persist until tier merge)

Best for: Write-heavy workloads, time-series data, append-only patterns.

Real-World Engines

InnoDB (MySQL): B-Tree clustered index. Excellent read performance, good for mixed OLTP. Write performance degrades under high random write load due to page splits and random I/O.

RocksDB: LSM-Tree with leveled compaction. Used as the storage backend for Cassandra (via CassandraDB's Stratio fork), MyRocks (MySQL + RocksDB), TiKV (TiDB's storage layer), and CockroachDB. Excellent write throughput, good for time-series and event data.

LevelDB: Google's original LSM-Tree implementation. Simpler than RocksDB; used in LevelDB-based embedded databases.

PostgreSQL: B-Tree heap-based storage (not clustered). Excellent for general OLTP. Extension ecosystem adds columnar storage (citus_columnar) and other access methods.

WiredTiger (MongoDB's default): B-Tree with MVCC. Replaced MMAPv1 in MongoDB 3.2. Supports both B-Tree and LSM-Tree storage engines (configurable per collection).

How to Choose

Your workload is primarily:

Random point lookups + range scans + moderate writes?
  └─ B-Tree (InnoDB, PostgreSQL, WiredTiger)
     Predictable read latency, excellent for OLTP

High-volume sequential or random writes?
  └─ LSM-Tree (RocksDB, Cassandra, LevelDB)
     Write throughput 2-10× higher than B-Tree under write pressure

Time-series data (append-heavy, ordered by timestamp)?
  └─ LSM-Tree (writes are naturally sequential by time)
     Or dedicated time-series engine (InfluxDB, TimescaleDB)

Mixed OLTP with occasional analytics?
  └─ B-Tree for OLTP tables
     Materialized views or a separate columnar engine for analytics

Write-heavy with occasional full range scans?
  └─ LSM-Tree with size-tiered compaction for maximum write throughput
     Accept higher read amplification; mitigate with Bloom filters

Space is constrained and write rate is very high?
  └─ LSM-Tree with leveled compaction
     Lower space amplification than size-tiered

🧭 What's Next

Post 11: WAL & Crash Recovery — both B-Tree and LSM-Tree rely on a Write-Ahead Log to survive crashes; this post explains exactly how WAL works, what ARIES recovery does, and what "durable" actually means at the hardware level