Schema Design: Normalization vs Denormalization and When to Choose Each

The Schema That Works Until It Does Not

Most schemas start normalized. A users table, an orders table, a products table — clean, consistent, no duplicated data. Then the product grows. A dashboard query that joins six tables starts taking 4 seconds. The engineering team adds a denormalized summary table. Then another. Then a cache layer. Six months later, nobody is sure which source of truth is authoritative.

This is not inevitable. It is the result of denormalizing reactively — after performance problems appear — rather than proactively, with a clear understanding of what is being traded.

This post gives you the framework to make schema decisions deliberately: when to normalize, when to denormalize, and how to handle the patterns — JSONB, materialized views, soft deletes, polymorphic associations — that appear in every production codebase and are rarely discussed in textbooks.

The Normalization–Denormalization Spectrum

Schema design exists on a spectrum. Neither extreme is correct:

Full Normalization                              Full Denormalization
────────────────────────────────────────────────────────────────────
✅ Zero data duplication                    ✅ Reads are fastest
✅ Updates touch one place                  ✅ No JOINs needed
✅ Constraints are easily enforced          ✅ Simple query structure
❌ Many JOINs for every read               ❌ Updates touch many rows
❌ Reads are slower under load             ❌ Inconsistency risk on writes
❌ Complex queries for simple questions    ❌ Schema changes are painful

Every real production schema sits somewhere in the middle.
The question is not "should I normalize?" but "where on this spectrum
does each part of my schema belong, and why?"

The answer depends on two variables: read/write ratio and consistency requirements.

When to Normalize

Normalize when the cost of inconsistency is high and the read/write pattern does not demand extreme read performance.

Signal 1 — Data that changes. If a customer changes their email address, you want to update it in one place. If email is denormalized into 12 tables, an update requires 12 writes — miss one and you have stale data silently floating through your system.

Signal 2 — Data with referential integrity requirements. Foreign keys enforce that an order can only reference an existing customer. Denormalized data cannot have foreign keys — the constraint simply does not apply when the data is duplicated.

Signal 3 — Write-heavy workloads. Each denormalization adds write overhead. A table that receives thousands of inserts per second cannot afford to maintain multiple derived copies of the same data synchronously.

Signal 4 — Unpredictable query patterns. If you do not know in advance how data will be queried — common in early-stage products — normalization preserves flexibility. Denormalized schemas optimize for known query patterns; change the queries and the denormalization becomes a liability.

-- Well-normalized schema for an e-commerce system
CREATE TABLE customers (
    id         SERIAL PRIMARY KEY,
    email      VARCHAR(255) UNIQUE NOT NULL,
    name       VARCHAR(100) NOT NULL,
    created_at TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    customer_id INTEGER      NOT NULL REFERENCES customers(id),
    status      VARCHAR(20)  NOT NULL DEFAULT 'pending',
    created_at  TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

CREATE TABLE order_items (
    order_id    INTEGER      NOT NULL REFERENCES orders(id),
    product_id  INTEGER      NOT NULL REFERENCES products(id),
    quantity    INTEGER      NOT NULL CHECK (quantity > 0),
    unit_price  NUMERIC(10,2) NOT NULL,  -- snapshot price at time of order
    PRIMARY KEY (order_id, product_id)
);

-- unit_price is intentionally denormalized here:
-- product price can change after an order is placed.
-- We need the price AT THE TIME OF THE ORDER — not today's price.
-- This is a deliberate denormalization for correctness, not performance.

When to Denormalize

Denormalize when a specific, well-understood query pattern is causing measurable performance problems and the consistency trade-off is acceptable.

The key word is measurable. Do not denormalize speculatively. Identify the slow query, confirm normalization is the cause, and then denormalize that specific access pattern.

Signal 1 — A read path is provably too slow. EXPLAIN ANALYZE shows a query joining 4–5 large tables with sequential scans. Indexes cannot help because too many tables are involved and the result set is large. Denormalization eliminates the JOINs entirely.

Signal 2 — Read/write ratio is heavily skewed toward reads. A product catalog updated once a day but read millions of times per hour. The cost of maintaining denormalized copies on writes is amortized over enormous read savings.

Signal 3 — The derived value is expensive to compute. Aggregate values — total order count per customer, total revenue per product — are expensive to compute from normalized tables at query time. Storing them denormalized and updating them incrementally is cheaper.

Signal 4 — The data is immutable or append-only. Logs, events, audit trails do not change after being written. Denormalizing into them has no consistency risk — there are no updates to maintain.

Denormalization Patterns

Pattern 1: Storing Aggregates

Instead of computing COUNT(*) or SUM() at query time, maintain the aggregate as a column on the parent table and update it incrementally.

-- Add denormalized counters to customers
ALTER TABLE customers
    ADD COLUMN order_count    INTEGER NOT NULL DEFAULT 0,
    ADD COLUMN total_spent    NUMERIC(12,2) NOT NULL DEFAULT 0;

-- Keep them in sync with a trigger
CREATE OR REPLACE FUNCTION update_customer_stats()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE customers
        SET order_count = order_count + 1,
            total_spent = total_spent + NEW.amount
        WHERE id = NEW.customer_id;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE customers
        SET order_count = order_count - 1,
            total_spent = total_spent - OLD.amount
        WHERE id = OLD.customer_id;
    ELSIF TG_OP = 'UPDATE' AND OLD.amount != NEW.amount THEN
        UPDATE customers
        SET total_spent = total_spent + (NEW.amount - OLD.amount)
        WHERE id = NEW.customer_id;
    END IF;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_customer_stats
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION update_customer_stats();

Now SELECT order_count, total_spent FROM customers WHERE id = 42 is an O(1) primary key lookup instead of an aggregate query over all orders.

Trade-off: The trigger adds overhead to every order write. If orders are written at high volume, this can become a bottleneck. Consider updating counts asynchronously (via an event queue) instead of synchronously in the trigger.

Pattern 2: Duplicating Columns for Read Performance

Copy a column from a related table into the querying table to eliminate a JOIN on the hot path.

-- Hot path: show order history with customer name
-- Without denormalization: JOIN required
SELECT o.id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.customer_id = 42;

-- With denormalization: no JOIN
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);

-- Populate on insert
INSERT INTO orders (customer_id, customer_name, amount)
SELECT 42, name, 299.99 FROM customers WHERE id = 42;

-- Problem: if customer changes their name, orders show stale name
-- Acceptable for: historical records (order was placed by "Alice")
-- Not acceptable for: current state queries (customer's current name)

This pattern is appropriate for immutable relationships — the customer name at the time of the order is a historical fact, not a current lookup.

Pattern 3: Summary Tables

A separate table that pre-computes aggregates over a large base table. Rebuilt periodically (nightly, hourly) rather than maintained in real-time.

-- Summary table: daily revenue per product category
CREATE TABLE daily_revenue_summary (
    date        DATE        NOT NULL,
    category_id INTEGER     NOT NULL REFERENCES categories(id),
    order_count INTEGER     NOT NULL,
    revenue     NUMERIC(15,2) NOT NULL,
    PRIMARY KEY (date, category_id)
);

-- Refresh job (run nightly via pg_cron or an external scheduler)
INSERT INTO daily_revenue_summary (date, category_id, order_count, revenue)
SELECT
    o.created_at::DATE,
    p.category_id,
    COUNT(DISTINCT o.id),
    SUM(oi.unit_price * oi.quantity)
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p    ON oi.product_id = p.id
WHERE o.created_at::DATE = CURRENT_DATE - 1
  AND o.status = 'completed'
GROUP BY 1, 2
ON CONFLICT (date, category_id) DO UPDATE SET
    order_count = EXCLUDED.order_count,
    revenue     = EXCLUDED.revenue;

Dashboard queries hit the summary table instead of the multi-table join. The data is one day stale — acceptable for analytics, unacceptable for real-time operations.

Materialized Views: Denormalization With Automatic Refresh

A materialized view stores the result of a query on disk, like a cache that the database manages. Unlike a summary table, the database knows how to refresh it.

-- Create a materialized view for customer order statistics
CREATE MATERIALIZED VIEW customer_order_stats AS
SELECT
    c.id                            AS customer_id,
    c.name,
    c.email,
    COUNT(o.id)                     AS order_count,
    COALESCE(SUM(o.amount), 0)      AS total_spent,
    MAX(o.created_at)               AS last_order_at,
    AVG(o.amount)                   AS avg_order_value
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email
WITH DATA;  -- populate immediately

-- Create an index on the materialized view for fast lookups
CREATE INDEX ON customer_order_stats (customer_id);
CREATE INDEX ON customer_order_stats (total_spent DESC);

-- Query the materialized view: instant, no JOIN
SELECT * FROM customer_order_stats
WHERE total_spent > 1000
ORDER BY total_spent DESC;

-- Refresh when data changes (full refresh — locks the view briefly)
REFRESH MATERIALIZED VIEW customer_order_stats;

-- Concurrent refresh: allows reads during refresh (requires unique index)
CREATE UNIQUE INDEX ON customer_order_stats (customer_id);
REFRESH MATERIALIZED VIEW CONCURRENTLY customer_order_stats;

Materialized view vs summary table:

	Materialized View	Summary Table
Definition	SQL query stored in DB	Regular table with INSERT logic
Refresh	`REFRESH MATERIALIZED VIEW`	Custom job / trigger
Concurrent reads during refresh	Yes (with CONCURRENTLY)	Yes always
Incremental refresh	No (full refresh only in PostgreSQL)	Yes (manual)
Index support	Yes	Yes
Complexity	Low	Higher

Use materialized views for moderately complex aggregations refreshed on a schedule. Use summary tables when you need incremental updates or more control over the refresh logic.

JSONB: Semi-Structured Data in a Relational Table

PostgreSQL's JSONB type stores JSON data in a binary format that supports indexing. It bridges the gap between relational and document models.

When JSONB makes sense:

Attributes that vary per row and cannot be normalized (product metadata, user preferences, event payloads)
Rapid iteration where the schema is not yet known
Storing external API responses that have inconsistent structure

-- Products with highly variable attributes per category
CREATE TABLE products (
    id          SERIAL PRIMARY KEY,
    name        VARCHAR(200) NOT NULL,
    category_id INTEGER NOT NULL,
    price       NUMERIC(10,2) NOT NULL,
    attributes  JSONB          -- flexible per-category metadata
);

-- Electronics: {"ram": "16GB", "storage": "512GB", "screen": "14 inch"}
-- Clothing: {"size": "M", "color": "blue", "material": "cotton"}
INSERT INTO products (name, category_id, price, attributes)
VALUES
    ('MacBook Pro', 1, 2499.00, '{"ram": "16GB", "storage": "512GB"}'),
    ('Blue T-Shirt', 2,   29.00, '{"size": "M", "color": "blue"}');

-- Query JSONB fields
SELECT name, attributes->>'ram' AS ram
FROM products
WHERE category_id = 1
  AND (attributes->>'ram') = '16GB';

-- Index a specific JSONB field (btree on extracted value)
CREATE INDEX idx_product_ram
ON products ((attributes->>'ram'))
WHERE category_id = 1;

-- GIN index: indexes ALL keys and values (supports @>, ?, ?| operators)
CREATE INDEX idx_product_attrs_gin ON products USING GIN (attributes);

-- Find products with a specific key
SELECT * FROM products WHERE attributes ? 'ram';

-- Find products with specific key-value pairs
SELECT * FROM products WHERE attributes @> '{"color": "blue"}';

JSONB operators:

-> returns JSON: attributes->'ram' → "16GB" (JSON string)
->> returns text: attributes->>'ram' → 16GB (text)
#> path: attributes#>'{specs,ram}' → nested value
@> contains: attributes @> '{"color": "blue"}'
? key exists: attributes ? 'ram'

When NOT to use JSONB: If you query the same fields consistently, normalize them into real columns. A real column with a B-Tree index is faster and cleaner than a JSONB field with a GIN index. JSONB is for genuinely variable or unknown-at-schema-time attributes.

Soft Deletes: The Pattern Everybody Uses Wrong

A soft delete marks a row as deleted without removing it from the database.

-- Common implementation
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ DEFAULT NULL;

-- Soft delete
UPDATE users SET deleted_at = NOW() WHERE id = 42;

-- "Active" users query
SELECT * FROM users WHERE deleted_at IS NULL;

The problems:

Problem 1 — Every query needs the filter. Forget WHERE deleted_at IS NULL once and deleted users appear in your UI. In ORMs, this is solved with default scopes, but raw SQL queries are always at risk.

Problem 2 — UNIQUE constraints break. A user is soft-deleted, then a new user registers with the same email. The UNIQUE constraint on email fires — the old (deleted) row is still there.

-- Fix: partial unique index that excludes deleted rows
CREATE UNIQUE INDEX idx_users_email_active
ON users (email)
WHERE deleted_at IS NULL;
-- Now two rows with the same email can exist: one deleted, one active

Problem 3 — Indexes become bloated. The partial index above helps, but all queries still scan deleted rows unless the index explicitly excludes them.

-- Add deleted_at to partial indexes throughout your schema
CREATE INDEX idx_orders_customer
ON orders (customer_id, created_at)
WHERE deleted_at IS NULL;

Problem 4 — Table grows forever. Soft-deleted rows accumulate. Use a background job to hard-delete rows that have been soft-deleted for more than 90 days, and archive them to a separate deleted_users table first if audit trail is needed.

Alternative: use a separate archived table

-- On delete: move row to archive table
INSERT INTO users_archive SELECT *, NOW() AS deleted_at FROM users WHERE id = 42;
DELETE FROM users WHERE id = 42;

-- Pros: main table stays clean, indexes stay lean, UNIQUE constraints work
-- Cons: need to query both tables for historical data

Polymorphic Associations: The Pattern to Avoid

A polymorphic association is a foreign key that can reference multiple tables — comments that belong to either a post or a product, for example.

-- Common anti-pattern: type + id polymorphic association
CREATE TABLE comments (
    id            SERIAL PRIMARY KEY,
    body          TEXT NOT NULL,
    commentable_type VARCHAR(50),  -- 'Post' or 'Product'
    commentable_id   INTEGER       -- id in the referenced table
    -- No foreign key possible — which table does this reference?
);

Why this is problematic:

No foreign key constraint possible — database cannot enforce referential integrity
No join possible without dynamic SQL
commentable_id = 5 is ambiguous — is it post 5 or product 5?
Adding a new commentable type requires application code changes, not just schema changes

Better pattern 1: separate join tables

CREATE TABLE post_comments (
    comment_id INTEGER REFERENCES comments(id),
    post_id    INTEGER REFERENCES posts(id),
    PRIMARY KEY (comment_id, post_id)
);

CREATE TABLE product_comments (
    comment_id INTEGER REFERENCES comments(id),
    product_id INTEGER REFERENCES products(id),
    PRIMARY KEY (comment_id, product_id)
);
-- Full referential integrity. Clean joins. Explicit schema.

Better pattern 2: nullable foreign keys

CREATE TABLE comments (
    id         SERIAL PRIMARY KEY,
    body       TEXT NOT NULL,
    post_id    INTEGER REFERENCES posts(id),
    product_id INTEGER REFERENCES products(id),
    -- Enforce exactly one is set:
    CONSTRAINT exactly_one_parent CHECK (
        (post_id IS NOT NULL)::INT + (product_id IS NOT NULL)::INT = 1
    )
);

The Schema Design Decision Framework

NEW COLUMN OR TABLE — ask these questions before deciding:

1. How often does this data change?
   ├─ Frequently (names, prices, statuses) → Normalize — update one place
   └─ Rarely or never (order snapshot price) → Can denormalize safely

2. Is it derived from other data (count, sum, average)?
   ├─ Low query frequency → Compute at query time (normalize)
   ├─ High query frequency, updated infrequently → Materialized view
   └─ High query frequency, updated frequently → Trigger-maintained counter

3. Does it have variable structure per row?
   ├─ Yes, truly variable → JSONB
   └─ No, consistent structure → Real columns (faster, indexable)

4. Do queries on this data join 3+ large tables?
   ├─ No → Stay normalized
   └─ Yes, on a hot path → Measure first; denormalize if confirmed slow

5. Is "deleted" a state, not an action?
   ├─ Need full audit trail → Soft delete + archive job
   └─ No audit requirement → Hard delete (simplest, cleanest)

6. Does a foreign key reference multiple tables?
   └─ Always → Separate join tables or nullable FKs; never polymorphic type+id

Chapter 2 Complete

You have finished the SQL Deep Dive chapter:

✅ Post 06: Advanced SQL — Window Functions, CTEs & Recursive Queries
✅ Post 07: Query Optimization — EXPLAIN Plans & Slow Query Fixes
✅ Post 08: Schema Design — Normalization vs Denormalization

Chapter 1 gave you the fundamentals. Chapter 2 gave you the tools to write SQL that performs well and schemas that hold up under real-world load. Chapter 3 goes a level deeper — inside the database engine itself: how data is physically stored, why some writes are faster than others, and how the database recovers from crashes without losing a single committed transaction.

🧭 What's Next — Chapter 3: Storage Engine Internals

Post 09: How Databases Store Data — pages, heap files, and the difference between row-oriented and column-oriented storage that determines whether your query takes milliseconds or minutes