Schema Design: Normalization vs Denormalization and When to Choose Each
The Schema That Works Until It Does Not
Most schemas start normalized. A users table, an orders table, a products table — clean, consistent, no duplicated data. Then the product grows. A dashboard query that joins six tables starts taking 4 seconds. The engineering team adds a denormalized summary table. Then another. Then a cache layer. Six months later, nobody is sure which source of truth is authoritative.
This is not inevitable. It is the result of denormalizing reactively — after performance problems appear — rather than proactively, with a clear understanding of what is being traded.
This post gives you the framework to make schema decisions deliberately: when to normalize, when to denormalize, and how to handle the patterns — JSONB, materialized views, soft deletes, polymorphic associations — that appear in every production codebase and are rarely discussed in textbooks.
The Normalization–Denormalization Spectrum
Schema design exists on a spectrum. Neither extreme is correct:
Full Normalization Full Denormalization
────────────────────────────────────────────────────────────────────
✅ Zero data duplication ✅ Reads are fastest
✅ Updates touch one place ✅ No JOINs needed
✅ Constraints are easily enforced ✅ Simple query structure
❌ Many JOINs for every read ❌ Updates touch many rows
❌ Reads are slower under load ❌ Inconsistency risk on writes
❌ Complex queries for simple questions ❌ Schema changes are painful
Every real production schema sits somewhere in the middle.
The question is not "should I normalize?" but "where on this spectrum
does each part of my schema belong, and why?"The answer depends on two variables: read/write ratio and consistency requirements.
When to Normalize
Normalize when the cost of inconsistency is high and the read/write pattern does not demand extreme read performance.
Signal 1 — Data that changes. If a customer changes their email address, you want to update it in one place. If email is denormalized into 12 tables, an update requires 12 writes — miss one and you have stale data silently floating through your system.
Signal 2 — Data with referential integrity requirements. Foreign keys enforce that an order can only reference an existing customer. Denormalized data cannot have foreign keys — the constraint simply does not apply when the data is duplicated.
Signal 3 — Write-heavy workloads. Each denormalization adds write overhead. A table that receives thousands of inserts per second cannot afford to maintain multiple derived copies of the same data synchronously.
Signal 4 — Unpredictable query patterns. If you do not know in advance how data will be queried — common in early-stage products — normalization preserves flexibility. Denormalized schemas optimize for known query patterns; change the queries and the denormalization becomes a liability.
-- Well-normalized schema for an e-commerce system
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
status VARCHAR(20) NOT NULL DEFAULT 'pending',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(10,2) NOT NULL, -- snapshot price at time of order
PRIMARY KEY (order_id, product_id)
);
-- unit_price is intentionally denormalized here:
-- product price can change after an order is placed.
-- We need the price AT THE TIME OF THE ORDER — not today's price.
-- This is a deliberate denormalization for correctness, not performance.When to Denormalize
Denormalize when a specific, well-understood query pattern is causing measurable performance problems and the consistency trade-off is acceptable.
The key word is measurable. Do not denormalize speculatively. Identify the slow query, confirm normalization is the cause, and then denormalize that specific access pattern.
Signal 1 — A read path is provably too slow. EXPLAIN ANALYZE shows a query joining 4–5 large tables with sequential scans. Indexes cannot help because too many tables are involved and the result set is large. Denormalization eliminates the JOINs entirely.
Signal 2 — Read/write ratio is heavily skewed toward reads. A product catalog updated once a day but read millions of times per hour. The cost of maintaining denormalized copies on writes is amortized over enormous read savings.
Signal 3 — The derived value is expensive to compute. Aggregate values — total order count per customer, total revenue per product — are expensive to compute from normalized tables at query time. Storing them denormalized and updating them incrementally is cheaper.
Signal 4 — The data is immutable or append-only. Logs, events, audit trails do not change after being written. Denormalizing into them has no consistency risk — there are no updates to maintain.
Denormalization Patterns
Pattern 1: Storing Aggregates
Instead of computing COUNT(*) or SUM() at query time, maintain the aggregate as a column on the parent table and update it incrementally.
-- Add denormalized counters to customers
ALTER TABLE customers
ADD COLUMN order_count INTEGER NOT NULL DEFAULT 0,
ADD COLUMN total_spent NUMERIC(12,2) NOT NULL DEFAULT 0;
-- Keep them in sync with a trigger
CREATE OR REPLACE FUNCTION update_customer_stats()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
UPDATE customers
SET order_count = order_count + 1,
total_spent = total_spent + NEW.amount
WHERE id = NEW.customer_id;
ELSIF TG_OP = 'DELETE' THEN
UPDATE customers
SET order_count = order_count - 1,
total_spent = total_spent - OLD.amount
WHERE id = OLD.customer_id;
ELSIF TG_OP = 'UPDATE' AND OLD.amount != NEW.amount THEN
UPDATE customers
SET total_spent = total_spent + (NEW.amount - OLD.amount)
WHERE id = NEW.customer_id;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_customer_stats
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION update_customer_stats();Now SELECT order_count, total_spent FROM customers WHERE id = 42 is an O(1) primary key lookup instead of an aggregate query over all orders.
Trade-off: The trigger adds overhead to every order write. If orders are written at high volume, this can become a bottleneck. Consider updating counts asynchronously (via an event queue) instead of synchronously in the trigger.
Pattern 2: Duplicating Columns for Read Performance
Copy a column from a related table into the querying table to eliminate a JOIN on the hot path.
-- Hot path: show order history with customer name
-- Without denormalization: JOIN required
SELECT o.id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.customer_id = 42;
-- With denormalization: no JOIN
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
-- Populate on insert
INSERT INTO orders (customer_id, customer_name, amount)
SELECT 42, name, 299.99 FROM customers WHERE id = 42;
-- Problem: if customer changes their name, orders show stale name
-- Acceptable for: historical records (order was placed by "Alice")
-- Not acceptable for: current state queries (customer's current name)This pattern is appropriate for immutable relationships — the customer name at the time of the order is a historical fact, not a current lookup.
Pattern 3: Summary Tables
A separate table that pre-computes aggregates over a large base table. Rebuilt periodically (nightly, hourly) rather than maintained in real-time.
-- Summary table: daily revenue per product category
CREATE TABLE daily_revenue_summary (
date DATE NOT NULL,
category_id INTEGER NOT NULL REFERENCES categories(id),
order_count INTEGER NOT NULL,
revenue NUMERIC(15,2) NOT NULL,
PRIMARY KEY (date, category_id)
);
-- Refresh job (run nightly via pg_cron or an external scheduler)
INSERT INTO daily_revenue_summary (date, category_id, order_count, revenue)
SELECT
o.created_at::DATE,
p.category_id,
COUNT(DISTINCT o.id),
SUM(oi.unit_price * oi.quantity)
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at::DATE = CURRENT_DATE - 1
AND o.status = 'completed'
GROUP BY 1, 2
ON CONFLICT (date, category_id) DO UPDATE SET
order_count = EXCLUDED.order_count,
revenue = EXCLUDED.revenue;Dashboard queries hit the summary table instead of the multi-table join. The data is one day stale — acceptable for analytics, unacceptable for real-time operations.
Materialized Views: Denormalization With Automatic Refresh
A materialized view stores the result of a query on disk, like a cache that the database manages. Unlike a summary table, the database knows how to refresh it.
-- Create a materialized view for customer order statistics
CREATE MATERIALIZED VIEW customer_order_stats AS
SELECT
c.id AS customer_id,
c.name,
c.email,
COUNT(o.id) AS order_count,
COALESCE(SUM(o.amount), 0) AS total_spent,
MAX(o.created_at) AS last_order_at,
AVG(o.amount) AS avg_order_value
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email
WITH DATA; -- populate immediately
-- Create an index on the materialized view for fast lookups
CREATE INDEX ON customer_order_stats (customer_id);
CREATE INDEX ON customer_order_stats (total_spent DESC);
-- Query the materialized view: instant, no JOIN
SELECT * FROM customer_order_stats
WHERE total_spent > 1000
ORDER BY total_spent DESC;
-- Refresh when data changes (full refresh — locks the view briefly)
REFRESH MATERIALIZED VIEW customer_order_stats;
-- Concurrent refresh: allows reads during refresh (requires unique index)
CREATE UNIQUE INDEX ON customer_order_stats (customer_id);
REFRESH MATERIALIZED VIEW CONCURRENTLY customer_order_stats;Materialized view vs summary table:
Materialized View | Summary Table | |
|---|---|---|
Definition | SQL query stored in DB | Regular table with INSERT logic |
Refresh |
| Custom job / trigger |
Concurrent reads during refresh | Yes (with CONCURRENTLY) | Yes always |
Incremental refresh | No (full refresh only in PostgreSQL) | Yes (manual) |
Index support | Yes | Yes |
Complexity | Low | Higher |
Use materialized views for moderately complex aggregations refreshed on a schedule. Use summary tables when you need incremental updates or more control over the refresh logic.
JSONB: Semi-Structured Data in a Relational Table
PostgreSQL's JSONB type stores JSON data in a binary format that supports indexing. It bridges the gap between relational and document models.
When JSONB makes sense:
Attributes that vary per row and cannot be normalized (product metadata, user preferences, event payloads)
Rapid iteration where the schema is not yet known
Storing external API responses that have inconsistent structure
-- Products with highly variable attributes per category
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
category_id INTEGER NOT NULL,
price NUMERIC(10,2) NOT NULL,
attributes JSONB -- flexible per-category metadata
);
-- Electronics: {"ram": "16GB", "storage": "512GB", "screen": "14 inch"}
-- Clothing: {"size": "M", "color": "blue", "material": "cotton"}
INSERT INTO products (name, category_id, price, attributes)
VALUES
('MacBook Pro', 1, 2499.00, '{"ram": "16GB", "storage": "512GB"}'),
('Blue T-Shirt', 2, 29.00, '{"size": "M", "color": "blue"}');
-- Query JSONB fields
SELECT name, attributes->>'ram' AS ram
FROM products
WHERE category_id = 1
AND (attributes->>'ram') = '16GB';
-- Index a specific JSONB field (btree on extracted value)
CREATE INDEX idx_product_ram
ON products ((attributes->>'ram'))
WHERE category_id = 1;
-- GIN index: indexes ALL keys and values (supports @>, ?, ?| operators)
CREATE INDEX idx_product_attrs_gin ON products USING GIN (attributes);
-- Find products with a specific key
SELECT * FROM products WHERE attributes ? 'ram';
-- Find products with specific key-value pairs
SELECT * FROM products WHERE attributes @> '{"color": "blue"}';JSONB operators:
->returns JSON:attributes->'ram'→"16GB"(JSON string)->>returns text:attributes->>'ram'→16GB(text)#>path:attributes#>'{specs,ram}'→ nested value@>contains:attributes @> '{"color": "blue"}'?key exists:attributes ? 'ram'
When NOT to use JSONB: If you query the same fields consistently, normalize them into real columns. A real column with a B-Tree index is faster and cleaner than a JSONB field with a GIN index. JSONB is for genuinely variable or unknown-at-schema-time attributes.
Soft Deletes: The Pattern Everybody Uses Wrong
A soft delete marks a row as deleted without removing it from the database.
-- Common implementation
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ DEFAULT NULL;
-- Soft delete
UPDATE users SET deleted_at = NOW() WHERE id = 42;
-- "Active" users query
SELECT * FROM users WHERE deleted_at IS NULL;The problems:
Problem 1 — Every query needs the filter. Forget WHERE deleted_at IS NULL once and deleted users appear in your UI. In ORMs, this is solved with default scopes, but raw SQL queries are always at risk.
Problem 2 — UNIQUE constraints break. A user is soft-deleted, then a new user registers with the same email. The UNIQUE constraint on email fires — the old (deleted) row is still there.
-- Fix: partial unique index that excludes deleted rows
CREATE UNIQUE INDEX idx_users_email_active
ON users (email)
WHERE deleted_at IS NULL;
-- Now two rows with the same email can exist: one deleted, one activeProblem 3 — Indexes become bloated. The partial index above helps, but all queries still scan deleted rows unless the index explicitly excludes them.
-- Add deleted_at to partial indexes throughout your schema
CREATE INDEX idx_orders_customer
ON orders (customer_id, created_at)
WHERE deleted_at IS NULL;Problem 4 — Table grows forever. Soft-deleted rows accumulate. Use a background job to hard-delete rows that have been soft-deleted for more than 90 days, and archive them to a separate deleted_users table first if audit trail is needed.
Alternative: use a separate archived table
-- On delete: move row to archive table
INSERT INTO users_archive SELECT *, NOW() AS deleted_at FROM users WHERE id = 42;
DELETE FROM users WHERE id = 42;
-- Pros: main table stays clean, indexes stay lean, UNIQUE constraints work
-- Cons: need to query both tables for historical dataPolymorphic Associations: The Pattern to Avoid
A polymorphic association is a foreign key that can reference multiple tables — comments that belong to either a post or a product, for example.
-- Common anti-pattern: type + id polymorphic association
CREATE TABLE comments (
id SERIAL PRIMARY KEY,
body TEXT NOT NULL,
commentable_type VARCHAR(50), -- 'Post' or 'Product'
commentable_id INTEGER -- id in the referenced table
-- No foreign key possible — which table does this reference?
);Why this is problematic:
No foreign key constraint possible — database cannot enforce referential integrity
No join possible without dynamic SQL
commentable_id = 5is ambiguous — is it post 5 or product 5?Adding a new commentable type requires application code changes, not just schema changes
Better pattern 1: separate join tables
CREATE TABLE post_comments (
comment_id INTEGER REFERENCES comments(id),
post_id INTEGER REFERENCES posts(id),
PRIMARY KEY (comment_id, post_id)
);
CREATE TABLE product_comments (
comment_id INTEGER REFERENCES comments(id),
product_id INTEGER REFERENCES products(id),
PRIMARY KEY (comment_id, product_id)
);
-- Full referential integrity. Clean joins. Explicit schema.Better pattern 2: nullable foreign keys
CREATE TABLE comments (
id SERIAL PRIMARY KEY,
body TEXT NOT NULL,
post_id INTEGER REFERENCES posts(id),
product_id INTEGER REFERENCES products(id),
-- Enforce exactly one is set:
CONSTRAINT exactly_one_parent CHECK (
(post_id IS NOT NULL)::INT + (product_id IS NOT NULL)::INT = 1
)
);The Schema Design Decision Framework
NEW COLUMN OR TABLE — ask these questions before deciding:
1. How often does this data change?
├─ Frequently (names, prices, statuses) → Normalize — update one place
└─ Rarely or never (order snapshot price) → Can denormalize safely
2. Is it derived from other data (count, sum, average)?
├─ Low query frequency → Compute at query time (normalize)
├─ High query frequency, updated infrequently → Materialized view
└─ High query frequency, updated frequently → Trigger-maintained counter
3. Does it have variable structure per row?
├─ Yes, truly variable → JSONB
└─ No, consistent structure → Real columns (faster, indexable)
4. Do queries on this data join 3+ large tables?
├─ No → Stay normalized
└─ Yes, on a hot path → Measure first; denormalize if confirmed slow
5. Is "deleted" a state, not an action?
├─ Need full audit trail → Soft delete + archive job
└─ No audit requirement → Hard delete (simplest, cleanest)
6. Does a foreign key reference multiple tables?
└─ Always → Separate join tables or nullable FKs; never polymorphic type+idChapter 2 Complete
You have finished the SQL Deep Dive chapter:
✅ Post 06: Advanced SQL — Window Functions, CTEs & Recursive Queries
✅ Post 07: Query Optimization — EXPLAIN Plans & Slow Query Fixes
✅ Post 08: Schema Design — Normalization vs Denormalization
Chapter 1 gave you the fundamentals. Chapter 2 gave you the tools to write SQL that performs well and schemas that hold up under real-world load. Chapter 3 goes a level deeper — inside the database engine itself: how data is physically stored, why some writes are faster than others, and how the database recovers from crashes without losing a single committed transaction.
🧭 What's Next — Chapter 3: Storage Engine Internals
Post 09: How Databases Store Data — pages, heap files, and the difference between row-oriented and column-oriented storage that determines whether your query takes milliseconds or minutes
Related
Query Optimization: How to Read EXPLAIN Plans and Fix Slow Queries
A slow query in production is a fire. EXPLAIN plans show exactly where the database wastes time — if you know how to read them. This post teaches you how, with real examples.
Advanced SQL: Window Functions, CTEs and Recursive Queries
Window functions and CTEs turn 20-line subquery nightmares into clean SQL. Master ROW_NUMBER, RANK, LAG/LEAD, recursive CTEs, and the patterns that appear in real codebases.
Transactions & ACID: Isolation Levels, Locking and the Bugs They Prevent
Race conditions in databases are silent and devastating. This post explains ACID, all four isolation levels, and the concurrency bugs each level prevents — with examples.
Comments