What is a Database & DBMS: Why Not Just Use Files?

The Question Nobody Asks

Most developers start using databases on day one of their first project. They install PostgreSQL or MySQL, wire up an ORM, and start writing queries. The database just works — until it doesn't.

When it doesn't work, the questions start. Why is this query slow? Why did two simultaneous writes corrupt each other? Why did a crash leave the data in an inconsistent state? Why does the database use so much disk space just to store what seems like simple data?

These questions all have the same root: most developers use databases without understanding what a database actually is or what problem it was designed to solve. This post fixes that. Understanding the "why" of databases changes how you use them — and how you debug them when they break.

What Is a Database?

A database is an organized collection of structured data, stored and accessed electronically. That definition is technically correct and almost completely useless. A text file full of comma-separated values is also an organized collection of structured data. So is a spreadsheet. So is a folder of JSON files.

The definition that actually matters is this: a database is structured data plus the guarantees you need to work with it safely.

Those guarantees are what justify the complexity. A flat file has none of them. A database has all of them. The question is: what are those guarantees, and why can't you get them from a file?

Why Not Just Use Files?

Let's build the case from first principles. Imagine you are building a simple banking application. You decide to store accounts in a CSV file:

account_id,name,balance
1,Alice,5000
2,Bob,3000
3,Carol,8000

This works fine when one person uses the application. The moment it becomes a real system, you run into five problems that files cannot solve.

Problem 1: Concurrent Access

Two users try to transfer money at the same time. Your application reads the file, modifies the balance in memory, and writes the file back. Both users read the file before either writes it back.

Thread A reads: Alice=5000, Bob=3000
Thread B reads: Alice=5000, Carol=8000

Thread A writes: Alice=4000, Bob=4000   (transferred 1000 from Alice to Bob)
Thread B writes: Alice=4000, Carol=9000 (transferred 1000 from Alice to Carol)

Final state: Alice=4000, Bob=4000, Carol=9000
Alice lost 2000 but appears to have only lost 1000.

This is a race condition — the classic lost update problem. Files have no mechanism for locking or coordinating concurrent access. Two processes can read and write simultaneously, and the result is undefined.

A database solves this with transactions and locking. When Thread A begins transferring money, it acquires locks on the relevant rows. Thread B must wait until Thread A finishes before it can read or write those rows. No data is corrupted.

Problem 2: Crash Recovery

Your server loses power mid-write. You were in the middle of deducting 1000 from Alice and adding it to Bob. The deduction was written to disk. The addition was not.

Before crash:
  Alice=5000, Bob=3000

After crash:
  Alice=4000, Bob=3000   ← 1000 vanished

With a file, you have no way to recover. The write was partial. You cannot roll it back — there is no record of what the original values were or what the intended operation was.

A database solves this with the Write-Ahead Log (WAL). Before modifying any data, the database writes the intended change to a log. If a crash occurs mid-write, the database replays the log on restart to bring data to a consistent state. Nothing is ever half-done.

Problem 3: Search Performance

Your user file grows to 10 million rows. The user searches for their account by email. Your application reads every line of the CSV, compares the email, and returns the match.

python

# O(n) — reads the entire file for every search
with open('users.csv') as f:
    for line in f:
        if line.split(',')[2] == target_email:
            return line

At 10 million rows, this takes seconds. At 100 million rows, it times out. Files have no query optimization. Every search is a full scan.

A database solves this with indexes — data structures (typically B-Trees) that allow O(log n) lookups instead of O(n) scans. A query against an indexed column on 100 million rows takes milliseconds, not minutes.

Problem 4: Data Integrity

Nothing stops you from writing nonsense into a CSV file:

account_id,name,balance
1,Alice,5000
2,,3000          ← name is empty
3,Carol,-99999   ← negative balance
4,Dave,not_a_number  ← balance is a string

Files cannot enforce constraints. A database can:

sql

CREATE TABLE accounts (
    account_id  SERIAL PRIMARY KEY,
    name        VARCHAR(100) NOT NULL,        -- cannot be empty
    balance     NUMERIC(15,2) CHECK (balance >= 0),  -- must be non-negative
    email       VARCHAR(255) UNIQUE           -- must be globally unique
);

Any attempt to insert invalid data is rejected before it reaches storage. The database is the last line of defense — it enforces the rules even when application code has a bug.

Problem 5: Multi-Table Queries

Your application grows. Accounts now have transactions. Each transaction references an account. With files, you have accounts.csv and transactions.csv. To find all transactions for Alice over the last 30 days, you:

Read all of accounts.csv to find Alice's account ID
Read all of transactions.csv to filter by account ID and date
Join the two results in memory

python

# O(n × m) — reads both files entirely, joins in memory
accounts = read_csv('accounts.csv')
transactions = read_csv('transactions.csv')
alice_id = next(a['id'] for a in accounts if a['name'] == 'Alice')
result = [t for t in transactions
          if t['account_id'] == alice_id
          and t['date'] >= thirty_days_ago]

At scale, this is unworkable. A database executes this as a single optimized query:

sql

SELECT t.*
FROM transactions t
JOIN accounts a ON t.account_id = a.id
WHERE a.name = 'Alice'
  AND t.date >= NOW() - INTERVAL '30 days';

The query planner selects the most efficient execution path — using indexes, choosing join algorithms, estimating row counts. You describe what you want; the database decides how to get it.

What Is a DBMS?

A Database Management System (DBMS) is the software that sits between your application and the raw data. It is the engine that provides all five guarantees above — and many more.

The components of a DBMS:

Your Application
      │
      ▼
┌─────────────────────────────────────┐
│              DBMS                   │
│                                     │
│  Query Parser & Optimizer           │  ← Translates SQL into execution plan
│  Transaction Manager                │  ← Manages ACID guarantees
│  Lock Manager                       │  ← Coordinates concurrent access
│  Buffer Pool Manager                │  ← Caches disk pages in memory
│  Storage Engine                     │  ← Reads/writes data to disk
│  Recovery Manager                   │  ← Handles crashes via WAL
│                                     │
└─────────────────────────────────────┘
      │
      ▼
   Storage (Disk / SSD)

When you run SELECT * FROM accounts WHERE id = 1, this is what happens:

Parser validates the SQL syntax and builds a parse tree
Optimizer evaluates multiple execution plans and picks the cheapest one (using indexes if available)
Executor runs the plan, calling the storage engine to fetch data
Buffer Pool checks if the needed disk page is already in memory (cache hit) — if not, reads from disk
Result is returned to your application

Every step is managed by the DBMS. Your application sees only the result.

Types of Databases

Not every problem needs the same solution. Databases have evolved into several families, each optimized for different workloads:

Relational (RDBMS) — data stored in tables with rows and columns, relationships enforced via foreign keys, queried with SQL. Strong ACID guarantees. Best for structured data with complex relationships. Examples: PostgreSQL, MySQL, SQLite, Oracle, SQL Server

Document — data stored as JSON-like documents. Flexible schema — each document can have different fields. Good for hierarchical data that maps naturally to objects in code. Examples: MongoDB, CouchDB, Firestore

Key-Value — data stored as simple key → value pairs. Extremely fast reads and writes. No query language — you fetch by key only. Best for caching, sessions, and simple lookups. Examples: Redis, DynamoDB, etcd

Wide-Column — data stored in tables but with dynamic columns per row. Optimized for write-heavy workloads at massive scale. Eventual consistency by default. Examples: Cassandra, HBase, Google Bigtable

Graph — data stored as nodes and edges. Native support for traversing relationships. Best for social networks, recommendation engines, and fraud detection. Examples: Neo4j, Amazon Neptune

Time-Series — data stored with a timestamp as the primary dimension. Optimized for append-heavy writes and time-range queries. Best for metrics, monitoring, and IoT. Examples: InfluxDB, TimescaleDB, Prometheus

The key insight: these are not competing products — they are different tools. PostgreSQL and Redis are not alternatives; they solve different problems. A production system often uses several databases simultaneously.

OLTP vs OLAP: Two Different Workloads

Within relational databases, there is a critical distinction that shapes every design decision:

OLTP (Online Transaction Processing) — high-volume, low-latency operations on individual rows. This is your application database — inserts, updates, lookups by primary key. Thousands of small transactions per second.

sql

-- Typical OLTP query: fetch one row by primary key
SELECT * FROM orders WHERE id = 12345;

-- Typical OLTP write: update one row
UPDATE accounts SET balance = balance - 100 WHERE id = 42;

OLAP (Online Analytical Processing) — low-volume, high-latency operations that scan large portions of data to compute aggregates. This is your analytics database — sum, count, average across millions of rows.

sql

-- Typical OLAP query: aggregate across millions of rows
SELECT
    DATE_TRUNC('month', created_at) AS month,
    country,
    SUM(amount)                     AS total_revenue,
    COUNT(DISTINCT user_id)         AS unique_customers
FROM orders
WHERE created_at >= '2025-01-01'
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

Running OLAP queries on an OLTP database is one of the most common causes of production performance crises. A single long-running analytical query can lock tables and block thousands of short transactional queries.

The solution is separation: OLTP data lives in PostgreSQL or MySQL, tuned for row-level operations. Analytical data is replicated to a columnar data warehouse (BigQuery, Redshift, Snowflake), tuned for full-table scans and aggregations.

The Database Selection Framework

When starting a new project or evaluating a change, these are the questions to ask:

What is the primary access pattern?
  ├─ Fetch by primary key, low latency     → Key-Value (Redis, DynamoDB)
  ├─ Complex queries, relationships        → Relational (PostgreSQL)
  ├─ Hierarchical/nested documents         → Document (MongoDB)
  ├─ Write-heavy, massive scale            → Wide-Column (Cassandra)
  ├─ Relationship traversal               → Graph (Neo4j)
  └─ Time-series metrics                  → Time-Series (InfluxDB)

Do you need ACID transactions?
  ├─ Yes, across multiple entities        → Relational
  └─ No, or eventual consistency is OK    → NoSQL options

What is the schema stability?
  ├─ Fixed, well-defined schema           → Relational
  └─ Flexible, evolving per document      → Document

What is the scale requirement?
  ├─ Single machine, moderate traffic     → Relational (PostgreSQL handles more than you think)
  └─ Horizontal scale, massive writes     → NoSQL (Cassandra, DynamoDB)

The most common mistake is reaching for a distributed NoSQL database when PostgreSQL running on a single large machine would handle the load — with far less operational complexity. PostgreSQL can comfortably handle tens of thousands of queries per second and terabytes of data. The threshold where you genuinely need horizontal scaling is higher than most teams think.

What's Next in This Chapter

This post answered the foundational question — what a database is and why files are not enough. The next four posts in this chapter build on this foundation:

Post 02: The relational model — tables, keys, and normalization from 1NF to BCNF
Post 03: SQL fundamentals — every JOIN type, GROUP BY, and the queries that trip everyone up
Post 04: Indexes — how B-Tree and Hash indexes work and when to use each
Post 05: Transactions — ACID properties, isolation levels, and the concurrency bugs each prevents

By the end of Chapter 1, you will have the conceptual foundation to understand everything that follows — storage internals, replication, sharding, and the distributed systems problems that make database engineering genuinely hard.

🧭 What's Next

Post 02: Relational Model & Normalization — bad schema design creates bugs invisible until production; learn to normalize correctly and know when to deliberately denormalize