SQL Fundamentals: SELECT, JOIN, GROUP BY and the Queries That Trip Everyone Up

SQL Is Not What You Think It Is

Most developers treat SQL like a scripting language — write a query, get some data, move on. That mental model works for simple cases and fails badly for complex ones.

SQL is a declarative language. You describe the result set you want. The database engine decides how to produce it. This means the order you write clauses in has nothing to do with the order they are executed — and that gap between write order and execution order is where most SQL confusion comes from.

Logical execution order of a SELECT statement:

1. FROM         — identify the source tables
2. JOIN         — combine tables
3. WHERE        — filter rows (before grouping)
4. GROUP BY     — group rows into buckets
5. HAVING       — filter groups (after grouping)
6. SELECT       — compute output columns
7. DISTINCT     — remove duplicates
8. ORDER BY     — sort the result
9. LIMIT/OFFSET — paginate

Write order: SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT Execute order: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

This single fact explains most GROUP BY errors, most "column not found" errors in HAVING clauses, and most confusion about why aliases work in ORDER BY but not in WHERE.

SELECT and FROM: The Basics Done Right

sql

-- Select specific columns (always prefer this over SELECT *)
SELECT
    order_id,
    customer_id,
    amount,
    created_at
FROM orders;

-- SELECT * in production code is an anti-pattern:
-- 1. Fetches columns you don't need (wastes I/O and network)
-- 2. Breaks silently when columns are added/removed
-- 3. Makes query intent unclear to the next developer

-- Column aliases
SELECT
    order_id                            AS id,
    amount * 1.1                        AS amount_with_tax,
    DATE_TRUNC('month', created_at)     AS month
FROM orders;

-- Expression in SELECT — the database computes this per row
SELECT
    product_id,
    unit_price * quantity               AS line_total,
    UPPER(product_name)                 AS name_upper,
    CASE
        WHEN amount > 1000 THEN 'high'
        WHEN amount > 100  THEN 'medium'
        ELSE 'low'
    END                                 AS value_tier
FROM order_items
JOIN products USING (product_id);

WHERE: Filtering Rows

WHERE filters rows before any grouping. Every condition is evaluated per row.

sql

-- Basic comparisons
SELECT * FROM orders
WHERE amount > 100
  AND status = 'completed'
  AND created_at >= '2025-01-01';

-- IN: shorthand for multiple OR conditions
SELECT * FROM orders
WHERE status IN ('completed', 'shipped', 'processing');

-- BETWEEN: inclusive on both ends
SELECT * FROM orders
WHERE created_at BETWEEN '2025-01-01' AND '2025-03-31';

-- LIKE: pattern matching (% = any sequence, _ = one character)
SELECT * FROM customers
WHERE email LIKE '%@gmail.com';     -- ends with @gmail.com
WHERE name  LIKE 'Al_ce';           -- Alice, Alyce, etc.

-- IS NULL: never use = NULL, it always returns false
SELECT * FROM orders
WHERE shipped_at IS NULL;           -- orders not yet shipped
WHERE shipped_at IS NOT NULL;       -- orders that have been shipped

-- NOT: negate any condition
SELECT * FROM products
WHERE product_id NOT IN (SELECT product_id FROM discontinued);

The NULL trap: NULL = NULL evaluates to NULL, not TRUE. Always use IS NULL / IS NOT NULL for null checks. Any arithmetic or comparison involving NULL returns NULL — NULL + 1 = NULL, NULL > 0 = NULL.

JOIN: The Heart of Relational Queries

JOIN is where most developers get confused — not because the syntax is complex, but because the mental model of what each JOIN does is unclear. The key: always think of a JOIN as producing a new set of rows by combining two tables based on a condition.

INNER JOIN — Only Matching Rows

Returns rows where the join condition is satisfied in both tables. Non-matching rows are excluded from both sides.

sql

-- Orders with their customer information
-- Customers with no orders: NOT included
-- Orders with no valid customer: NOT included
SELECT
    o.order_id,
    o.amount,
    c.name     AS customer_name,
    c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;
-- INNER is optional: JOIN alone means INNER JOIN

LEFT JOIN — All Left Rows, Matching Right Rows

Returns all rows from the left table, with matched rows from the right. Non-matching right rows appear as NULL.

sql

-- ALL customers, with their orders if they have any
-- Customers with no orders: included (order columns are NULL)
SELECT
    c.id,
    c.name,
    o.order_id,     -- NULL if customer has no orders
    o.amount        -- NULL if customer has no orders
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;

-- Find customers with NO orders (NULL trick)
SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.order_id IS NULL;   -- only rows where right side is NULL = no match

RIGHT JOIN — All Right Rows, Matching Left Rows

Mirror of LEFT JOIN. Returns all rows from the right table. Rarely used in practice — you can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order. Most style guides prefer LEFT JOIN for consistency.

sql

-- Equivalent queries:
SELECT * FROM a RIGHT JOIN b ON a.id = b.a_id;
SELECT * FROM b LEFT JOIN a ON a.id = b.a_id;  -- same result, preferred style

FULL OUTER JOIN — All Rows from Both Sides

Returns all rows from both tables. Rows with no match on either side appear with NULLs for the other side's columns.

sql

-- All customers AND all orders — match where possible
SELECT
    c.name,
    o.order_id,
    o.amount
FROM customers c
FULL OUTER JOIN orders o ON c.id = o.customer_id;
-- Customers with no orders: name filled, order columns NULL
-- Orders with no valid customer: order columns filled, name NULL

CROSS JOIN — Every Combination

Returns the Cartesian product — every row from the left combined with every row from the right. If left has 100 rows and right has 200 rows, the result has 20,000 rows. Use intentionally and carefully.

sql

-- Generate a grid of all size/color combinations for a product configurator
SELECT
    s.size,
    c.color
FROM sizes s
CROSS JOIN colors c;
-- If sizes has 5 rows and colors has 8, result has 40 rows

SELF JOIN — A Table Joined to Itself

A table joined to itself. Used for hierarchical data (employees and their managers) or comparing rows within the same table.

sql

-- Each employee with their manager's name
SELECT
    e.name        AS employee,
    m.name        AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
-- CEO has no manager → manager_id is NULL → LEFT JOIN keeps the row, manager name is NULL

JOIN Visual Reference

Table A    Table B

  A ∩ B           → INNER JOIN  (intersection only)
  A ∪ (A ∩ B)     → LEFT JOIN   (all of A + matching B)
  B ∪ (A ∩ B)     → RIGHT JOIN  (all of B + matching A)
  A ∪ B           → FULL JOIN   (everything from both)
  A × B           → CROSS JOIN  (cartesian product)

GROUP BY and Aggregate Functions

GROUP BY collapses multiple rows that share the same value(s) into a single row, and aggregate functions compute a single value from each group.

sql

-- Total revenue and order count per customer
SELECT
    customer_id,
    COUNT(*)            AS order_count,
    SUM(amount)         AS total_spent,
    AVG(amount)         AS avg_order_value,
    MIN(amount)         AS smallest_order,
    MAX(amount)         AS largest_order
FROM orders
GROUP BY customer_id;

The GROUP BY rule: Every column in SELECT must either be in GROUP BY or wrapped in an aggregate function. This is the most common GROUP BY error.

sql

-- ❌ WRONG: name is not in GROUP BY and not aggregated
SELECT customer_id, name, COUNT(*)
FROM orders
GROUP BY customer_id;

-- ✅ CORRECT option 1: add name to GROUP BY
SELECT customer_id, name, COUNT(*)
FROM orders
GROUP BY customer_id, name;

-- ✅ CORRECT option 2: aggregate name (e.g. if all rows in a group have the same name)
SELECT customer_id, MAX(name) AS name, COUNT(*)
FROM orders
GROUP BY customer_id;

HAVING: Filtering Groups

WHERE filters rows before grouping. HAVING filters groups after aggregation.

sql

-- Customers who have spent more than 1000 total
SELECT
    customer_id,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 1000;   -- can't use WHERE here — amount isn't aggregated yet

-- Customers with more than 5 orders AND total spend > 500
SELECT
    customer_id,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'          -- filter individual rows first
GROUP BY customer_id
HAVING COUNT(*) > 5                 -- then filter groups
   AND SUM(amount) > 500;

WHERE vs HAVING: Use WHERE to filter rows (before aggregation). Use HAVING to filter groups (after aggregation). If you can express a filter in WHERE, always do — it reduces the number of rows that need to be grouped.

Subqueries: Queries Inside Queries

A subquery is a SELECT statement nested inside another statement. There are three types.

Scalar Subquery — Returns One Value

sql

-- Orders where the amount is above the overall average
SELECT order_id, amount
FROM orders
WHERE amount > (SELECT AVG(amount) FROM orders);
--              └── returns one number, used like a constant

Row/Table Subquery — Returns Multiple Rows

sql

-- Customers who have placed at least one order over $500
SELECT id, name
FROM customers
WHERE id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE amount > 500
);

-- Same result, often faster with EXISTS (stops at first match):
SELECT id, name
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.id
      AND o.amount > 500
);

Correlated Subquery — References the Outer Query

The subquery executes once per row of the outer query. Powerful but potentially slow — use with care.

sql

-- Each order with the customer's total spend (correlated subquery)
SELECT
    o.order_id,
    o.amount,
    (SELECT SUM(amount)
     FROM orders o2
     WHERE o2.customer_id = o.customer_id) AS customer_total
FROM orders o;

-- Better rewritten as a window function (one pass, not n passes):
SELECT
    order_id,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;

Common Table Expressions (CTEs)

CTEs (WITH clauses) name a subquery so you can reference it by name. They make complex queries readable without changing the execution semantics.

sql

-- Find the top 10 customers by total spend, with their most recent order date
WITH customer_stats AS (
    SELECT
        customer_id,
        COUNT(*)        AS order_count,
        SUM(amount)     AS total_spent,
        MAX(created_at) AS last_order_at
    FROM orders
    WHERE status = 'completed'
    GROUP BY customer_id
)
SELECT
    c.name,
    cs.order_count,
    cs.total_spent,
    cs.last_order_at
FROM customer_stats cs
JOIN customers c ON cs.customer_id = c.id
ORDER BY cs.total_spent DESC
LIMIT 10;

CTEs can reference each other — define them in dependency order:

sql

WITH
monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', created_at) AS month,
        SUM(amount) AS revenue
    FROM orders
    GROUP BY 1
),
revenue_with_growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
        revenue - LAG(revenue) OVER (ORDER BY month) AS growth
    FROM monthly_revenue
)
SELECT * FROM revenue_with_growth
ORDER BY month;

ORDER BY, LIMIT and Pagination

sql

-- Sort by multiple columns
SELECT * FROM orders
ORDER BY customer_id ASC, created_at DESC;

-- LIMIT and OFFSET for pagination
SELECT * FROM orders
ORDER BY created_at DESC
LIMIT 20 OFFSET 40;   -- page 3, 20 items per page

-- OFFSET pagination problem: slow at high offsets
-- OFFSET 10000 forces the database to scan and discard 10000 rows first
-- Use keyset pagination instead for large datasets:

-- Keyset pagination: use the last seen value as a cursor
SELECT * FROM orders
WHERE created_at < '2025-03-01 12:34:56'   -- last seen timestamp
ORDER BY created_at DESC
LIMIT 20;
-- Fast regardless of how deep you paginate — uses index on created_at

The 5 Most Common SQL Mistakes

Mistake 1 — Using OR with IN lists:

sql

-- Slow: OR prevents index use on some databases
WHERE status = 'completed' OR status = 'shipped' OR status = 'processing'

-- Fast: IN uses the same index efficiently
WHERE status IN ('completed', 'shipped', 'processing')

Mistake 2 — Function on indexed column in WHERE:

sql

-- ❌ Index on created_at is NOT used — function prevents it
WHERE YEAR(created_at) = 2025

-- ✅ Index is used — range condition on the raw column
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'

Mistake 3 — Implicit type conversion:

sql

-- ❌ customer_id is INTEGER but compared to string — implicit cast, index skipped
WHERE customer_id = '42'

-- ✅ Same type — index used
WHERE customer_id = 42

Mistake 4 — COUNT(*) vs COUNT(column):

sql

COUNT(*)         -- counts all rows including NULLs
COUNT(column)    -- counts non-NULL values only — different result if NULLs exist
COUNT(DISTINCT column) -- counts unique non-NULL values

Mistake 5 — OFFSET pagination at scale:

sql

-- ❌ Scans and discards 100,000 rows before returning 20
LIMIT 20 OFFSET 100000

-- ✅ Uses index to jump directly to the cursor position
WHERE id > 100000 ORDER BY id LIMIT 20

Quick Reference: When to Use What

Need all rows from one table + matching from another?  → LEFT JOIN
Need only rows that match in both tables?              → INNER JOIN
Need every combination of two tables?                 → CROSS JOIN
Need to compare a table to itself?                    → SELF JOIN
Need to filter before grouping?                       → WHERE
Need to filter after grouping?                        → HAVING
Need a value from another query?                      → Subquery or CTE
Need to check existence without fetching data?        → EXISTS
Need to paginate efficiently at scale?                → Keyset pagination

🧭 What's Next

Post 04: Database Indexes — indexes are the biggest performance lever in your database; learn how B-Tree and Hash indexes work internally, when to add them, and when they silently make things worse