Database Indexes: B-Tree, Hash, Composite — When and Why to Use Them — ndlab

The Query That Changed Everything

You have a users table with 50 million rows. You run:

sql

SELECT * FROM users WHERE email = 'alice@example.com';

Without an index, the database reads every single row — all 50 million — compares the email, and returns the one match. On a modern SSD, this takes several seconds. On a spinning disk, longer. As the table grows, so does the wait.

With an index on email, the same query returns in under a millisecond. The database jumps directly to the matching row without reading anything else.

That is the power of an index. It is also why adding a single CREATE INDEX statement is often the highest-leverage performance optimization available — and why understanding how indexes work is not optional for anyone who writes production SQL.

What an Index Actually Is

An index is a separate data structure maintained by the database alongside the table. It stores a subset of the table's columns — the indexed columns — in a structure optimized for fast lookup, alongside a pointer back to the full row.

Table: users (heap storage, unordered)

Row 1:  id=1,   email='charlie@...',  name='Charlie', ...
Row 2:  id=2,   email='alice@...',    name='Alice',   ...
Row 3:  id=3,   email='bob@...',      name='Bob',     ...
...
Row 50M: ...

Index on email (B-Tree, ordered):

'alice@...'    → pointer to Row 2
'bob@...'      → pointer to Row 3
'charlie@...'  → pointer to Row 1
...

The index is kept in sorted order. Finding alice@... in a sorted structure of 50 million entries takes about 26 comparisons (log₂ 50,000,000 ≈ 25.6). Finding it in an unsorted heap requires up to 50 million comparisons.

The trade-off: every index consumes disk space and must be updated on every INSERT, UPDATE, and DELETE. You are trading write performance and storage for read performance.

B-Tree Index: The Default

The B-Tree (Balanced Tree) is the default index type in PostgreSQL, MySQL, and virtually every other relational database. When you write CREATE INDEX, you get a B-Tree.

How a B-Tree Works

A B-Tree is a self-balancing tree where every node can hold multiple keys and multiple child pointers. The "balanced" part means every leaf node is at the same depth — lookups always take the same number of steps regardless of which key you search for.

B-Tree index on age column:

                    [30 | 60]
                   /    |    \
          [15|25]    [40|50]    [70|80]
          /  |  \    /  |  \    /  |  \
        [10][20][27][35][45][55][65][75][90]
         ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑
       leaf nodes with pointers to actual rows

Search for age = 45:

Start at root: 45 > 30 and 45 < 60 → go to middle child
At [40|50]: 45 > 40 and 45 < 50 → go to middle child
At leaf [45]: found. Follow pointer to row.

Three comparisons regardless of table size. For a table with billions of rows, a B-Tree of height 4–5 is sufficient.

What B-Tree Supports

B-Tree indexes support all comparison operators — this makes them useful for a wide range of query patterns:

sql

-- Equality
WHERE email = 'alice@example.com'           -- ✅ B-Tree

-- Range queries (B-Tree stores sorted data)
WHERE age BETWEEN 25 AND 40                  -- ✅ B-Tree
WHERE created_at >= '2025-01-01'             -- ✅ B-Tree
WHERE price < 100                            -- ✅ B-Tree

-- Prefix matching (LIKE with leading constant)
WHERE name LIKE 'Ali%'                       -- ✅ B-Tree (prefix known)
WHERE name LIKE '%ice'                       -- ❌ B-Tree cannot help (prefix unknown)

-- Sorting (B-Tree is already sorted)
ORDER BY last_name ASC                       -- ✅ B-Tree can serve sort without sorting
ORDER BY last_name DESC                      -- ✅ B-Tree traversed in reverse

-- IS NULL (PostgreSQL stores NULLs in B-Tree; MySQL InnoDB also indexes NULLs)
WHERE deleted_at IS NULL                     -- ✅ B-Tree (PostgreSQL)

Hash Index: O(1) Equality Only

A Hash index computes a hash of the indexed value and stores it in a hash table alongside a pointer to the row. Lookups are O(1) — exactly one hash computation, one table lookup.

Hash index on email:

HASH('alice@...')   = 0x4A2F  → pointer to row
HASH('bob@...')     = 0x81C3  → pointer to row
HASH('charlie@...')  = 0x2E91  → pointer to row

Hash indexes are faster than B-Tree for equality lookups — O(1) vs O(log n). The restriction: they support only equality comparisons (=). No ranges, no sorting, no prefix matching.

sql

WHERE email = 'alice@example.com'    -- ✅ Hash is perfect here
WHERE email LIKE 'alice%'            -- ❌ Hash cannot help
WHERE age > 25                       -- ❌ Hash cannot help
ORDER BY name                        -- ❌ Hash has no ordering

In PostgreSQL: Hash indexes exist as a first-class index type and are now crash-safe (as of PostgreSQL 10). Use them when your queries are exclusively equality lookups on a column.

In MySQL/InnoDB: Hash indexes are used internally by the Adaptive Hash Index (AHI) — InnoDB automatically creates hash indexes in memory for frequently accessed B-Tree pages. You do not create them manually.

In Redis and Memcached: The entire storage model is a hash table — every key lookup is O(1) equality.

Composite Indexes: Column Order Matters

A composite index covers multiple columns. The order of columns in the index definition is critical — it determines which queries the index can serve.

sql

CREATE INDEX idx_orders_customer_date
ON orders (customer_id, created_at);

This index stores rows sorted first by customer_id, then by created_at within each customer. Think of it as a phone book sorted by last name, then first name within each last name.

Queries this index can serve:

sql

-- ✅ Uses index: leading column (customer_id) is in WHERE
WHERE customer_id = 42

-- ✅ Uses index: both columns present
WHERE customer_id = 42 AND created_at >= '2025-01-01'

-- ✅ Uses index for customer_id part, filters created_at after
WHERE customer_id = 42 AND created_at BETWEEN '2025-01-01' AND '2025-03-31'

-- ❌ Does NOT use index: leading column missing
WHERE created_at >= '2025-01-01'

The leftmost prefix rule: A composite index (A, B, C) can serve queries on (A), (A, B), or (A, B, C) — but not (B), (C), or (B, C) alone. The index is only usable if the query filters from the leftmost column.

Column Order Strategy

Put the most selective column first — the column that eliminates the most rows per value.

sql

-- Scenario: 1M orders, 10K customers, 365 possible dates

-- orders.status has 4 values: ~250K rows per value (low selectivity)
-- orders.customer_id has 10K values: ~100 rows per value (high selectivity)

-- ✅ Good: high-selectivity column first
CREATE INDEX idx ON orders (customer_id, status);
-- Finding customer_id=42 reduces to ~100 rows immediately, then filter by status

-- ❌ Less good: low-selectivity column first
CREATE INDEX idx ON orders (status, customer_id);
-- Finding status='completed' still returns 250K rows to scan

Exception: if your most common query filters on a low-selectivity column as an equality condition and a high-selectivity column as a range, put the equality column first regardless of selectivity.

sql

-- Common query: find all completed orders for a date range
WHERE status = 'completed' AND created_at >= '2025-01-01'

-- ✅ Put equality column first, then range column
CREATE INDEX idx ON orders (status, created_at);
-- status='completed' reduces rows, then B-Tree range scan on created_at

Covering Indexes: The Index That Answers Everything

A covering index contains all the columns a query needs — so the database never needs to touch the actual table rows. The index itself is the answer.

sql

-- Query: find order amounts for a customer, sorted by date
SELECT amount, created_at
FROM orders
WHERE customer_id = 42
ORDER BY created_at DESC;

-- Covering index: includes all columns needed by this query
CREATE INDEX idx_covering
ON orders (customer_id, created_at, amount);
--          ↑ WHERE      ↑ ORDER BY  ↑ SELECT

-- The database reads only the index, never touches the heap.
-- Faster: less I/O, better cache utilization.

When PostgreSQL uses a covering index, you will see Index Only Scan in the EXPLAIN output — the most efficient scan type. Without a covering index, it uses Index Scan (reads index then fetches each row from heap).

Partial Indexes: Index Only What You Query

A partial index indexes only the rows that satisfy a condition. Smaller index, faster lookups, less write overhead.

sql

-- Only index unshipped orders — shipped orders are rarely queried by status
CREATE INDEX idx_unshipped
ON orders (customer_id, created_at)
WHERE shipped_at IS NULL;

-- Only index active users — deleted users are never queried in the app
CREATE INDEX idx_active_users
ON users (email)
WHERE deleted_at IS NULL;

-- Only index failed payments (monitoring dashboard)
CREATE INDEX idx_failed_payments
ON payments (created_at, amount)
WHERE status = 'failed';

If 95% of orders are shipped and only 5% are pending, an index on all orders is 20× larger than it needs to be for the query that only looks at pending orders. The partial index is smaller, fits in memory better, and stays more cache-friendly.

Expression Indexes: Index Computed Values

An expression index stores the result of a function or expression rather than the raw column value. Use it when your WHERE clause applies a function to a column.

sql

-- Query: case-insensitive email lookup
WHERE LOWER(email) = LOWER('Alice@Example.com')

-- Without expression index: full table scan (function on column prevents index use)
-- With expression index: index on the computed value
CREATE INDEX idx_email_lower ON users (LOWER(email));

-- Now this query uses the index:
WHERE LOWER(email) = 'alice@example.com'

-- Other useful expression indexes:
CREATE INDEX idx_year ON orders (EXTRACT(YEAR FROM created_at));
CREATE INDEX idx_name_prefix ON users (LEFT(last_name, 3));

When Indexes Hurt: The Write Penalty

Every index you add makes writes slower. On every INSERT, UPDATE, or DELETE, the database must update every index on the table in addition to the table itself.

INSERT into a table with 5 indexes:
  1. Write the new row to the heap
  2. Update index 1 (B-Tree insert, potentially rebalance)
  3. Update index 2
  4. Update index 3
  5. Update index 4
  6. Update index 5

vs. INSERT into a table with 0 indexes:
  1. Write the new row to the heap

For write-heavy tables (logging, event streams, audit trails), too many indexes can make writes significantly slower and increase lock contention.

The index audit checklist:

sql

-- PostgreSQL: find indexes that are never used
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan        AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

-- Large, never-used indexes are pure write overhead — consider dropping them

The Index Decision Framework

Is the column used in WHERE, JOIN ON, or ORDER BY frequently?
  └─ No  → Do NOT index (write overhead with no read benefit)
  └─ Yes → Continue

What type of query?
  ├─ Equality only (=)          → Hash index (faster) or B-Tree
  ├─ Range (<, >, BETWEEN)      → B-Tree only
  ├─ Prefix LIKE ('abc%')       → B-Tree
  ├─ Full-text search           → Full-text index (GIN/TSVECTOR)
  └─ Geospatial                 → GiST index

Multiple columns in WHERE?
  └─ Yes → Composite index, leftmost = most selective equality column

Query needs only indexed columns?
  └─ Yes → Add extra columns to make it a covering index

Only a subset of rows is ever queried?
  └─ Yes → Partial index with WHERE condition

Function applied to column in WHERE?
  └─ Yes → Expression index on the function result

Table is write-heavy (>50% writes)?
  └─ Be conservative — each index adds write overhead
  └─ Use partial indexes to minimize index size
  └─ Audit unused indexes regularly and drop them

The 5 Most Common Index Mistakes

Mistake 1 — Indexing every column "just in case." Each unused index wastes disk space and slows every write. Index based on actual query patterns, not intuition.

Mistake 2 — Wrong column order in composite index. An index on (status, customer_id) does not help a query filtering only on customer_id. The leftmost prefix rule is non-negotiable.

Mistake 3 — Applying a function to an indexed column in WHERE. WHERE YEAR(created_at) = 2025 cannot use an index on created_at. Rewrite as a range: WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'.

Mistake 4 — Not using covering indexes on hot paths. If your most-executed query fetches three columns, adding those three columns to the index eliminates heap fetches entirely. One EXPLAIN check can reveal this opportunity.

Mistake 5 — Never auditing indexes. Schema changes and evolving query patterns leave orphaned indexes behind — indexes that consume space and slow writes but serve no active query. Audit pg_stat_user_indexes quarterly.

🧭 What's Next

Post 05: Transactions & ACID — race conditions in databases are silent and devastating; learn all four isolation levels and the specific concurrency bugs each one prevents

Database Indexes: B-Tree, Hash, Composite — When and Why to Use Them