Relational Model & Normalization: How to Design Tables That Last

The Schema You Will Regret

Every developer has a schema they regret. It usually starts simple — a users table, an orders table, maybe a products table. Then the requirements change. New columns get added. Some columns are sometimes null and sometimes not. A column that stores one value for 80% of rows suddenly needs to store multiple values. You add a metadata JSON blob to avoid an ALTER TABLE. Six months later, data is duplicated across three tables, a simple query takes five JOINs, and nobody is sure which copy of a customer's email address is authoritative.

This is not bad luck. It is the predictable consequence of designing a schema without understanding the relational model. This post teaches you the model — and the normalization rules that follow from it — so you can design schemas that stay correct as your application evolves.

The Relational Model

The relational model was proposed by Edgar F. Codd in 1970. Its core idea is deceptively simple: represent all data as relations (tables), and describe relationships between data using values, not pointers.

Before relational databases, hierarchical and network databases navigated data by following physical pointers — memory addresses stored directly in records. If the physical layout changed, every query broke. Codd's insight was that queries should describe what data to retrieve, not how to navigate to it. The database engine figures out the navigation.

Three concepts form the foundation:

Relation (Table) — a set of tuples (rows), each with the same attributes (columns). A relation has no duplicate rows and no inherent ordering — it is a mathematical set, not a list.

Tuple (Row) — a single record in a relation. Each attribute holds exactly one atomic value — no arrays, no nested objects, no sets within a cell.

Attribute (Column) — a named property with a defined domain (data type + constraints). Every value in a column belongs to that domain.

Relation: orders

┌──────────┬────────────┬───────────┬──────────────┐
│ order_id │ customer_id│  amount   │  created_at  │
├──────────┼────────────┼───────────┼──────────────┤
│    1     │    42      │  129.99   │ 2025-03-01   │
│    2     │    17      │   49.00   │ 2025-03-02   │
│    3     │    42      │  219.50   │ 2025-03-04   │
└──────────┴────────────┴───────────┴──────────────┘

- order_id: primary key (unique identifier for each row)
- customer_id: foreign key (references the customers table)
- amount: domain = positive decimal number
- created_at: domain = valid timestamp

Keys: The Language of Relationships

Keys are how the relational model connects tables without pointers. Understanding the four key types is prerequisite for everything that follows.

Superkey — any set of attributes that uniquely identifies a row. {order_id} is a superkey. So is {order_id, customer_id} — it uniquely identifies rows, just with more attributes than necessary.

Candidate Key — a minimal superkey. Remove any attribute and it no longer uniquely identifies rows. {order_id} is a candidate key. {order_id, customer_id} is not — you can remove customer_id and still have unique identification.

Primary Key — the candidate key chosen as the official row identifier. Every table should have one. Values must be unique and NOT NULL.

sql

CREATE TABLE orders (
    order_id    SERIAL       PRIMARY KEY,   -- primary key
    customer_id INTEGER      NOT NULL,
    amount      NUMERIC(10,2) NOT NULL CHECK (amount > 0),
    created_at  TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

Foreign Key — an attribute (or set of attributes) in one table that references the primary key of another table. Enforces referential integrity — you cannot have an order referencing a customer that does not exist.

sql

CREATE TABLE orders (
    order_id    SERIAL       PRIMARY KEY,
    customer_id INTEGER      NOT NULL REFERENCES customers(id),  -- foreign key
    ...
);

Composite Key — a primary key made of multiple columns. Used when no single column uniquely identifies a row.

sql

CREATE TABLE order_items (
    order_id   INTEGER  NOT NULL REFERENCES orders(order_id),
    product_id INTEGER  NOT NULL REFERENCES products(product_id),
    quantity   INTEGER  NOT NULL CHECK (quantity > 0),
    unit_price NUMERIC(10,2) NOT NULL,
    PRIMARY KEY (order_id, product_id)   -- composite primary key
);

Functional Dependencies: The Theory Behind Normalization

Normalization is not a set of arbitrary rules. It is derived from a single theoretical concept: functional dependencies.

A functional dependency A → B means: for every pair of rows in the table, if they have the same value of A, they must have the same value of B. In other words, A determines B.

In an orders table:
  order_id → customer_id    ✅ Each order belongs to exactly one customer
  order_id → amount         ✅ Each order has exactly one amount
  customer_id → order_id    ❌ A customer can have many orders

In an order_items table:
  (order_id, product_id) → quantity    ✅ Each line item has one quantity
  product_id → product_name            ✅ Each product has one name
  order_id → product_name              ❌ An order can have many products

The normalization forms are rules that eliminate specific anomalies caused by problematic functional dependencies. Each form removes one type of problem.

First Normal Form (1NF): Atomic Values

Rule: Every cell contains exactly one atomic value. No repeating groups, no arrays, no comma-separated lists.

Violation:

┌──────────┬───────────────┬──────────────────────────────┐
│ order_id │ customer_name │ product_ids                  │
├──────────┼───────────────┼──────────────────────────────┤
│    1     │ Alice         │ 101, 102, 103                │  ← array in one cell
│    2     │ Bob           │ 104                          │
└──────────┴───────────────┴──────────────────────────────┘

Why it is a problem: You cannot query WHERE product_ids = 102 without string parsing. You cannot add a foreign key constraint. Adding a 4th product requires modifying the cell value, not inserting a new row.

Fix — decompose into atomic rows:

┌──────────┬───────────────┬────────────┐
│ order_id │ customer_name │ product_id │
├──────────┼───────────────┼────────────┤
│    1     │ Alice         │    101     │
│    1     │ Alice         │    102     │
│    1     │ Alice         │    103     │
│    2     │ Bob           │    104     │
└──────────┴───────────────┴────────────┘

Now every cell is atomic. Each row represents one relationship between one order and one product.

Second Normal Form (2NF): No Partial Dependencies

Rule: Must be in 1NF. Every non-key attribute must depend on the whole primary key, not just part of it. (Only relevant when the primary key is composite.)

Violation:

order_items (order_id, product_id, quantity, product_name, product_price)
Primary key: (order_id, product_id)

Functional dependencies:
  (order_id, product_id) → quantity      ✅ full dependency
  product_id → product_name              ❌ partial dependency (only part of PK)
  product_id → product_price             ❌ partial dependency

Why it is a problem:

Update anomaly: If Apple updates the iPhone price, you must update every row in order_items that contains an iPhone — miss one and you have inconsistent data.
Delete anomaly: If you delete all orders containing product 101, you lose the product name and price.
Insert anomaly: You cannot add a new product until someone orders it.

Fix — separate product data into its own table:

sql

-- Products: product_id determines product_name and price
CREATE TABLE products (
    product_id    SERIAL PRIMARY KEY,
    product_name  VARCHAR(200) NOT NULL,
    product_price NUMERIC(10,2) NOT NULL
);

-- Order items: only stores what is specific to this line item
CREATE TABLE order_items (
    order_id    INTEGER REFERENCES orders(order_id),
    product_id  INTEGER REFERENCES products(product_id),
    quantity    INTEGER NOT NULL,
    PRIMARY KEY (order_id, product_id)
);

Now product_name and product_price exist in exactly one place. Updating an iPhone price means updating one row in products.

Third Normal Form (3NF): No Transitive Dependencies

Rule: Must be in 2NF. No non-key attribute should depend on another non-key attribute. Every non-key attribute must depend directly on the primary key, not transitively through another non-key attribute.

Violation:

employees (employee_id, name, department_id, department_name, department_budget)
Primary key: employee_id

Functional dependencies:
  employee_id → department_id      ✅ direct dependency on PK
  department_id → department_name  ❌ transitive dependency (non-key → non-key)
  department_id → department_budget ❌ transitive dependency

employee_id → department_id → department_name — the employee ID determines the department ID, which in turn determines the department name. department_name depends on employee_id only transitively, through department_id.

Why it is a problem:

Update anomaly: Renaming the Engineering department requires updating every employee row in that department.
Delete anomaly: Deleting all employees in a department loses the department name.
Insert anomaly: You cannot create a department until it has at least one employee.

Fix:

sql

CREATE TABLE departments (
    department_id     SERIAL PRIMARY KEY,
    department_name   VARCHAR(100) NOT NULL,
    department_budget NUMERIC(15,2)
);

CREATE TABLE employees (
    employee_id   SERIAL PRIMARY KEY,
    name          VARCHAR(100) NOT NULL,
    department_id INTEGER REFERENCES departments(department_id)
    -- department_name and budget are gone — they live in departments
);

Boyce-Codd Normal Form (BCNF): Tighter Than 3NF

Rule: Must be in 3NF. For every functional dependency A → B, A must be a superkey. No non-superkey attribute determines anything.

3NF allows one edge case that BCNF does not: a non-key attribute can determine part of the primary key if that part is also a candidate key. BCNF eliminates this.

Violation example:

course_registrations (student, course, teacher)
Candidate keys: (student, course), (student, teacher)

Functional dependency: teacher → course
(A teacher teaches exactly one course, but a course can have multiple teachers)

teacher is NOT a superkey, yet teacher → course exists. Violates BCNF.

Fix — decompose:

sql

-- Teacher determines course (teacher is the key here)
CREATE TABLE teacher_courses (
    teacher VARCHAR(100) PRIMARY KEY,
    course  VARCHAR(100) NOT NULL
);

-- Student registration links to teacher (who implies the course)
CREATE TABLE student_teachers (
    student VARCHAR(100),
    teacher VARCHAR(100) REFERENCES teacher_courses(teacher),
    PRIMARY KEY (student, teacher)
);

In practice, BCNF decomposition occasionally makes some queries harder to express. Most production schemas aim for 3NF and selectively denormalize for performance.

The Normalization Summary

Normal Form	Eliminates	Rule
1NF	Repeating groups, non-atomic values	Every cell is atomic
2NF	Partial dependencies	Every non-key attribute depends on the whole PK
3NF	Transitive dependencies	Every non-key attribute depends directly on the PK
BCNF	Remaining anomalies from 3NF edge cases	Every determinant is a superkey

Each form is a strict superset of the previous — a table in 3NF is also in 2NF and 1NF. You cannot skip levels.

When to Deliberately Denormalize

Normalization is not always the right answer in production. Sometimes you intentionally violate normal forms for performance:

Scenario 1 — Avoid expensive JOINs on hot paths. If a query is executed millions of times per day and requires JOINing four tables, denormalizing by duplicating some columns can eliminate the JOINs and reduce latency significantly.

sql

-- Normalized: requires JOIN to get customer_email
SELECT o.id, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.id = 12345;

-- Denormalized: customer_email stored directly on orders (duplicate data)
SELECT id, customer_email
FROM orders
WHERE id = 12345;
-- Fast, but: updating a customer email now requires updating all their orders

Scenario 2 — Read-heavy analytics tables. OLAP tables are often heavily denormalized (star schema or flat wide tables) because analytics queries scan millions of rows and JOINs at that scale are expensive.

Scenario 3 — Document stores. MongoDB and similar databases are deliberately denormalized by design. Embedding related data within a document trades update complexity for read performance.

The rule: normalize first, denormalize with intention. Denormalization is a deliberate performance optimization, not a shortcut to avoid thinking about schema design. Every denormalization decision creates a maintenance burden — make sure the performance gain justifies it.

Practical Schema Design Checklist

Before finalizing any schema, run through these checks:

✅ Every table has a primary key
✅ Primary keys are stable — they will not need to change (use surrogate keys like SERIAL/UUID)
✅ Foreign keys are declared (not just implied in application code)
✅ NOT NULL is applied to every column that should never be empty
✅ CHECK constraints enforce domain rules (positive amounts, valid status values)
✅ No comma-separated values or JSON arrays where a related table should exist (1NF)
✅ No column that depends on part of a composite key (2NF)
✅ No column that depends on another non-key column (3NF)
✅ Every denormalization is documented with a comment explaining the trade-off

🧭 What's Next

Post 03: SQL Fundamentals — every JOIN type, GROUP BY traps, subqueries, and the mental model that makes complex queries feel obvious rather than mysterious

The Schema You Will Regret

The Relational Model

Keys: The Language of Relationships

Functional Dependencies: The Theory Behind Normalization

First Normal Form (1NF): Atomic Values

Second Normal Form (2NF): No Partial Dependencies

Third Normal Form (3NF): No Transitive Dependencies

Boyce-Codd Normal Form (BCNF): Tighter Than 3NF

The Normalization Summary

When to Deliberately Denormalize

Practical Schema Design Checklist

🧭 What's Next

Related

Database Replication: Single-Leader, Multi-Leader and Leaderless Explained

WAL, Crash Recovery and Durability: How Databases Survive Power Failures

B-Tree vs LSM-Tree: Why InnoDB and RocksDB Make Different Trade-offs

Leave a comment

Comments