Database Partitioning and Sharding: Range, Hash and Consistent Hashing
Partitioning vs Sharding: The Distinction
Both terms describe splitting data into smaller pieces, but they operate at different levels:
Partitioning typically refers to splitting data within a single database server into logical segments. The database engine manages the partitions transparently — queries still use the same connection, same SQL, same schema. PostgreSQL's declarative partitioning is the clearest example.
Sharding refers to splitting data across multiple separate database servers (shards). Each shard is an independent database. The application (or a middleware layer) must route queries to the correct shard. Cross-shard queries require application-level coordination.
In practice, the terms are used interchangeably. This post covers both, starting with single-server partitioning and building up to multi-server sharding.
Single-Server Partitioning (PostgreSQL Declarative Partitioning)
PostgreSQL supports table partitioning natively. The database splits a logical table into multiple physical partitions based on a partition key, while presenting a single table interface to queries.
-- Range partitioning: orders split by month
CREATE TABLE orders (
id BIGSERIAL,
customer_id INTEGER NOT NULL,
amount NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE orders_2025_01 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE orders_2025_02 PARTITION OF orders
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... and so on
-- Query: PostgreSQL automatically routes to correct partition(s)
SELECT * FROM orders WHERE created_at >= '2025-01-01' AND created_at < '2025-02-01';
-- Only scans orders_2025_01 — partition pruning eliminates other partitions
-- Drop old data: instant (drop partition = drop table)
DROP TABLE orders_2024_01; -- instant, no DELETE neededBenefits of single-server partitioning:
Partition pruning: queries with partition key in WHERE skip irrelevant partitions
Faster maintenance:
DROP TABLEon a partition is instant vsDELETEon millions of rowsVacuum efficiency: VACUUM runs per partition, not the entire table
Index size: indexes are per partition — smaller, faster
List partitioning — partition by discrete values:
CREATE TABLE users (
id BIGSERIAL,
region VARCHAR(10) NOT NULL,
name VARCHAR(100)
) PARTITION BY LIST (region);
CREATE TABLE users_us PARTITION OF users FOR VALUES IN ('us-east', 'us-west');
CREATE TABLE users_eu PARTITION OF users FOR VALUES IN ('eu-west', 'eu-central');
CREATE TABLE users_apac PARTITION OF users FOR VALUES IN ('ap-southeast', 'ap-northeast');Hash partitioning — distribute evenly across N partitions:
CREATE TABLE events (
id BIGSERIAL,
user_id INTEGER NOT NULL,
payload JSONB
) PARTITION BY HASH (user_id);
CREATE TABLE events_0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);Multi-Server Sharding: The Three Strategies
When data must be distributed across multiple physical servers, you need a sharding strategy. The strategy determines how data is assigned to shards and how the application routes queries.
Strategy 1: Range Sharding
Assign each shard a contiguous range of the shard key.
Shard 0: customer_id 1 – 1,000,000
Shard 1: customer_id 1,000,001 – 2,000,000
Shard 2: customer_id 2,000,001 – 3,000,000
Shard 3: customer_id 3,000,001 – ∞Routing logic:
def get_shard(customer_id: int) -> int:
if customer_id <= 1_000_000: return 0
if customer_id <= 2_000_000: return 1
if customer_id <= 3_000_000: return 2
return 3
shard_connections = {0: db0, 1: db1, 2: db2, 3: db3}
conn = shard_connections[get_shard(customer_id)]The hot spot problem:
Range sharding creates a critical weakness: write hot spots. New customers get sequential IDs — all new writes go to the last shard. Shard 3 receives 100% of new customer writes while shards 0–2 are idle.
Time → New customers streaming in (IDs 3,000,001, 3,000,002, 3,000,003...)
Shard 0: 0% writes ← cold
Shard 1: 0% writes ← cold
Shard 2: 0% writes ← cold
Shard 3: 100% writes ← hot (all new IDs land here)When range sharding works well: When queries frequently access ranges of the shard key. Time-series data sharded by timestamp is natural — most queries ask for "last 7 days" which lands on one or two shards (sequential access pattern = no cross-shard scatter).
orders sharded by month:
Shard 0: Jan 2025 orders
Shard 1: Feb 2025 orders
Shard 2: Mar 2025 orders
Query: "all orders this month" → hits Shard 2 only ✅
Query: "all orders last 90 days" → hits Shards 0, 1, 2 ✅
Query: "all orders by customer 42 ever" → hits ALL shards ❌ (scatter-gather)Strategy 2: Hash Sharding
Compute a hash of the shard key modulo the number of shards. Each shard receives a uniform, pseudo-random distribution of keys.
shard = hash(customer_id) % num_shards
customer_id=1: hash=0xA3B2 → 0xA3B2 % 4 = 2 → Shard 2
customer_id=2: hash=0x7F91 → 0x7F91 % 4 = 1 → Shard 1
customer_id=3: hash=0xC401 → 0xC401 % 4 = 1 → Shard 1
customer_id=4: hash=0x2D88 → 0x2D88 % 4 = 0 → Shard 0Routing logic:
import hashlib
def get_shard(customer_id: int, num_shards: int) -> int:
key = str(customer_id).encode()
hash_val = int(hashlib.md5(key).hexdigest(), 16)
return hash_val % num_shards
conn = shard_connections[get_shard(customer_id, num_shards=4)]Benefits:
Even data distribution — no hot spots
Simple routing logic
The resharding problem:
Hash sharding has a fatal flaw when you need to add shards. When num_shards changes from 4 to 5, almost every key moves to a different shard:
With 4 shards: customer_id=42 → hash % 4 = 2 → Shard 2
With 5 shards: customer_id=42 → hash % 5 = 4 → Shard 4
~80% of all data must be migrated when adding one shard.
During migration: which shard do you read from?
The answer requires careful dual-read logic and is extremely painful.Strategy 3: Consistent Hashing
Consistent hashing solves the resharding problem by mapping both data keys and shard nodes onto a circular hash ring. Adding or removing a shard only moves a fraction of the data.
The hash ring (0 to 2^32):
0
│
┌────┴────┐
│ ↑ │
2^32┤ ├ 2^30
│ Ring │
│ │
└────┬────┘
│
2^31
Nodes placed at positions on the ring:
Shard A: position 0x1000 (hash("shard-A"))
Shard B: position 0x5000 (hash("shard-B"))
Shard C: position 0x9000 (hash("shard-C"))
Shard D: position 0xD000 (hash("shard-D"))
Data routing:
hash(customer_id) = 0x3500 → walk clockwise → first node is Shard B at 0x5000
hash(customer_id) = 0x7000 → walk clockwise → first node is Shard C at 0x9000Adding a shard:
Add Shard E at position 0x3000 (between Shard A at 0x1000 and Shard B at 0x5000):
Before: keys in range [0x1000, 0x5000) → Shard B
After: keys in range [0x1000, 0x3000) → Shard E ← moved
keys in range [0x3000, 0x5000) → Shard B ← unchanged
Only keys in [0x1000, 0x3000) move — roughly 1/N of all data,
where N is the total number of shards. All other keys are unaffected.Virtual nodes (vnodes): A single physical shard is assigned multiple positions on the ring (virtual nodes). This improves distribution uniformity — instead of 4 large arcs (one per shard), each shard owns many small arcs scattered around the ring.
Without vnodes (4 shards, 4 positions):
Shard A: 1 arc of 25% of ring
Shard B: 1 arc of 25% of ring
Problem: uneven distribution if hashes cluster
With vnodes (4 shards, 150 positions each = 600 total):
Shard A: 150 small arcs scattered across ring
Shard B: 150 small arcs scattered across ring
Result: near-uniform distribution across all shards
When a new shard is added:
Takes a fraction of vnodes from each existing shard
Data migration is distributed across all existing shards
→ No single shard bears the full migration burdenUsed by: Cassandra, DynamoDB, Riak, Amazon S3.
Cross-Shard Query Problem
Sharding eliminates the problem of too much data on one server. It creates a new problem: queries that need data from multiple shards.
Query: "Top 10 customers by total spend (all time)"
Without sharding:
SELECT customer_id, SUM(amount) FROM orders
GROUP BY customer_id ORDER BY SUM(amount) DESC LIMIT 10;
← One query, one result
With sharding (4 shards):
Step 1: Send query to all 4 shards (scatter)
Shard 0: top 10 from its data
Shard 1: top 10 from its data
Shard 2: top 10 from its data
Shard 3: top 10 from its data
Step 2: Application merges 40 results, re-ranks, returns top 10 (gather)
Problem: "top 10 from shard 0" may not include the globally top customer
because that customer's data spans multiple shards.
Need ALL customers from ALL shards, not just top 10 per shard.
Full scatter-gather: query all shards, merge in application
→ 4× the data transfer, 4× the query workThe shard key rule: Choose a shard key such that the most common queries touch exactly one shard. If your application queries primarily by customer_id, shard by customer_id. If you frequently query by order_date, shard by order_date.
There is always a query pattern that crosses shards. Design for the 90% case, accept that the 10% case requires scatter-gather or a separate denormalized read model.
Resharding: The Painful Inevitable
Every sharding strategy eventually requires adding shards as data grows. Resharding is one of the most operationally dangerous database operations.
The safest resharding approach — double-write + backfill:
Phase 1: Add new shards, start double-writing
New writes: go to BOTH old and new shard layout
Old data: still only on old shards
Phase 2: Backfill old data to new layout
Background job reads from old layout, writes to new layout
Runs slowly to minimize production impact
Phase 3: Verify consistency
Compare record counts, checksums between old and new layouts
Phase 4: Switch reads to new layout
Update routing logic to read from new layout
Monitor for inconsistencies
Phase 5: Stop double-writing
Writes now go only to new layout
Old layout is stale
Phase 6: Decommission old layout
After verification period, delete old data
Total duration: days to weeks for large datasets
Risk window: Phase 4 — reads switch before full backfill is completeTools that manage resharding:
Vitess (YouTube): MySQL sharding middleware with transparent resharding
Citus (PostgreSQL): distributed PostgreSQL with shard rebalancing
PlanetScale: MySQL-compatible with online schema changes and resharding
Choosing a Shard Key
The shard key decision is the most consequential design choice in a sharded system. A poor shard key cannot be changed without a full resharding operation.
Good shard key properties:
✅ High cardinality: many distinct values (user_id: millions of values)
Low cardinality causes hot shards (status: only 'active'/'inactive' = 2 shards max)
✅ Uniform distribution: values spread evenly across the key space
Sequential IDs with range sharding → hot spot on latest shard
✅ Queries mostly filter by shard key: minimizes cross-shard scatter
Shard by customer_id if most queries are "all orders for customer X"
✅ Immutable: the shard key value never changes for a row
If customer_id changes, the row must migrate to a different shard
(This essentially never happens with surrogate keys)
❌ Avoid timestamp as sole shard key (range sharding):
All current writes go to the "today" shard → write hot spot
❌ Avoid low-cardinality columns (status, region with few values):
Too few shards, uneven distribution
❌ Avoid columns that frequently appear in joins:
Joining by non-shard-key requires cross-shard scatter🧭 What's Next
Post 17: CAP Theorem & PACELC — sharding forces you to confront the fundamental trade-offs of distributed systems; CAP and PACELC are the frameworks that make those trade-offs explicit and the decisions around them deliberate
Related
Distributed Transactions: Two-Phase Commit, Saga and When to Use Each
Distributed transactions are hard. 2PC gives safety at the cost of availability. Saga gives availability at the cost of complexity. This post explains both and when each is right.
CAP Theorem and PACELC: What They Mean for Real Database Decisions
CAP theorem is misunderstood constantly. PACELC is more useful but almost unknown. This post explains both with concrete database examples and the decisions they actually inform.
Failover, Consensus and High Availability: Raft, Paxos and Automatic Failover
Automatic failover sounds simple — consensus makes it hard. This post explains leader election, Raft vs Paxos trade-offs, and what high availability means in distributed databases.
Comments