What is a Database & DBMS: Why Not Just Use Files?
The Question Nobody Asks
Most developers start using databases on day one of their first project. They install PostgreSQL or MySQL, wire up an ORM, and start writing queries. The database just works — until it doesn't.
When it doesn't work, the questions start. Why is this query slow? Why did two simultaneous writes corrupt each other? Why did a crash leave the data in an inconsistent state? Why does the database use so much disk space just to store what seems like simple data?
These questions all have the same root: most developers use databases without understanding what a database actually is or what problem it was designed to solve. This post fixes that. Understanding the "why" of databases changes how you use them — and how you debug them when they break.
What Is a Database?
A database is an organized collection of structured data, stored and accessed electronically. That definition is technically correct and almost completely useless. A text file full of comma-separated values is also an organized collection of structured data. So is a spreadsheet. So is a folder of JSON files.
The definition that actually matters is this: a database is structured data plus the guarantees you need to work with it safely.
Those guarantees are what justify the complexity. A flat file has none of them. A database has all of them. The question is: what are those guarantees, and why can't you get them from a file?
Why Not Just Use Files?
Let's build the case from first principles. Imagine you are building a simple banking application. You decide to store accounts in a CSV file:
account_id,name,balance
1,Alice,5000
2,Bob,3000
3,Carol,8000This works fine when one person uses the application. The moment it becomes a real system, you run into five problems that files cannot solve.
Problem 1: Concurrent Access
Two users try to transfer money at the same time. Your application reads the file, modifies the balance in memory, and writes the file back. Both users read the file before either writes it back.
Thread A reads: Alice=5000, Bob=3000
Thread B reads: Alice=5000, Carol=8000
Thread A writes: Alice=4000, Bob=4000 (transferred 1000 from Alice to Bob)
Thread B writes: Alice=4000, Carol=9000 (transferred 1000 from Alice to Carol)
Final state: Alice=4000, Bob=4000, Carol=9000
Alice lost 2000 but appears to have only lost 1000.This is a race condition — the classic lost update problem. Files have no mechanism for locking or coordinating concurrent access. Two processes can read and write simultaneously, and the result is undefined.
A database solves this with transactions and locking. When Thread A begins transferring money, it acquires locks on the relevant rows. Thread B must wait until Thread A finishes before it can read or write those rows. No data is corrupted.
Problem 2: Crash Recovery
Your server loses power mid-write. You were in the middle of deducting 1000 from Alice and adding it to Bob. The deduction was written to disk. The addition was not.
Before crash:
Alice=5000, Bob=3000
After crash:
Alice=4000, Bob=3000 ← 1000 vanishedWith a file, you have no way to recover. The write was partial. You cannot roll it back — there is no record of what the original values were or what the intended operation was.
A database solves this with the Write-Ahead Log (WAL). Before modifying any data, the database writes the intended change to a log. If a crash occurs mid-write, the database replays the log on restart to bring data to a consistent state. Nothing is ever half-done.
Problem 3: Search Performance
Your user file grows to 10 million rows. The user searches for their account by email. Your application reads every line of the CSV, compares the email, and returns the match.
python
# O(n) — reads the entire file for every search
with open('users.csv') as f:
for line in f:
if line.split(',')[2] == target_email:
return lineAt 10 million rows, this takes seconds. At 100 million rows, it times out. Files have no query optimization. Every search is a full scan.
A database solves this with indexes — data structures (typically B-Trees) that allow O(log n) lookups instead of O(n) scans. A query against an indexed column on 100 million rows takes milliseconds, not minutes.
Problem 4: Data Integrity
Nothing stops you from writing nonsense into a CSV file:
account_id,name,balance
1,Alice,5000
2,,3000 ← name is empty
3,Carol,-99999 ← negative balance
4,Dave,not_a_number ← balance is a stringFiles cannot enforce constraints. A database can:
sql
CREATE TABLE accounts (
account_id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL, -- cannot be empty
balance NUMERIC(15,2) CHECK (balance >= 0), -- must be non-negative
email VARCHAR(255) UNIQUE -- must be globally unique
);Any attempt to insert invalid data is rejected before it reaches storage. The database is the last line of defense — it enforces the rules even when application code has a bug.
Problem 5: Multi-Table Queries
Your application grows. Accounts now have transactions. Each transaction references an account. With files, you have accounts.csv and transactions.csv. To find all transactions for Alice over the last 30 days, you:
Read all of
accounts.csvto find Alice's account IDRead all of
transactions.csvto filter by account ID and dateJoin the two results in memory
python
# O(n × m) — reads both files entirely, joins in memory
accounts = read_csv('accounts.csv')
transactions = read_csv('transactions.csv')
alice_id = next(a['id'] for a in accounts if a['name'] == 'Alice')
result = [t for t in transactions
if t['account_id'] == alice_id
and t['date'] >= thirty_days_ago]At scale, this is unworkable. A database executes this as a single optimized query:
sql
SELECT t.*
FROM transactions t
JOIN accounts a ON t.account_id = a.id
WHERE a.name = 'Alice'
AND t.date >= NOW() - INTERVAL '30 days';The query planner selects the most efficient execution path — using indexes, choosing join algorithms, estimating row counts. You describe what you want; the database decides how to get it.
What Is a DBMS?
A Database Management System (DBMS) is the software that sits between your application and the raw data. It is the engine that provides all five guarantees above — and many more.
The components of a DBMS:
Your Application
│
▼
┌─────────────────────────────────────┐
│ DBMS │
│ │
│ Query Parser & Optimizer │ ← Translates SQL into execution plan
│ Transaction Manager │ ← Manages ACID guarantees
│ Lock Manager │ ← Coordinates concurrent access
│ Buffer Pool Manager │ ← Caches disk pages in memory
│ Storage Engine │ ← Reads/writes data to disk
│ Recovery Manager │ ← Handles crashes via WAL
│ │
└─────────────────────────────────────┘
│
▼
Storage (Disk / SSD)When you run SELECT * FROM accounts WHERE id = 1, this is what happens:
Parser validates the SQL syntax and builds a parse tree
Optimizer evaluates multiple execution plans and picks the cheapest one (using indexes if available)
Executor runs the plan, calling the storage engine to fetch data
Buffer Pool checks if the needed disk page is already in memory (cache hit) — if not, reads from disk
Result is returned to your application
Every step is managed by the DBMS. Your application sees only the result.
Types of Databases
Not every problem needs the same solution. Databases have evolved into several families, each optimized for different workloads:
Relational (RDBMS) — data stored in tables with rows and columns, relationships enforced via foreign keys, queried with SQL. Strong ACID guarantees. Best for structured data with complex relationships. Examples: PostgreSQL, MySQL, SQLite, Oracle, SQL Server
Document — data stored as JSON-like documents. Flexible schema — each document can have different fields. Good for hierarchical data that maps naturally to objects in code. Examples: MongoDB, CouchDB, Firestore
Key-Value — data stored as simple key → value pairs. Extremely fast reads and writes. No query language — you fetch by key only. Best for caching, sessions, and simple lookups. Examples: Redis, DynamoDB, etcd
Wide-Column — data stored in tables but with dynamic columns per row. Optimized for write-heavy workloads at massive scale. Eventual consistency by default. Examples: Cassandra, HBase, Google Bigtable
Graph — data stored as nodes and edges. Native support for traversing relationships. Best for social networks, recommendation engines, and fraud detection. Examples: Neo4j, Amazon Neptune
Time-Series — data stored with a timestamp as the primary dimension. Optimized for append-heavy writes and time-range queries. Best for metrics, monitoring, and IoT. Examples: InfluxDB, TimescaleDB, Prometheus
The key insight: these are not competing products — they are different tools. PostgreSQL and Redis are not alternatives; they solve different problems. A production system often uses several databases simultaneously.
OLTP vs OLAP: Two Different Workloads
Within relational databases, there is a critical distinction that shapes every design decision:
OLTP (Online Transaction Processing) — high-volume, low-latency operations on individual rows. This is your application database — inserts, updates, lookups by primary key. Thousands of small transactions per second.
sql
-- Typical OLTP query: fetch one row by primary key
SELECT * FROM orders WHERE id = 12345;
-- Typical OLTP write: update one row
UPDATE accounts SET balance = balance - 100 WHERE id = 42;OLAP (Online Analytical Processing) — low-volume, high-latency operations that scan large portions of data to compute aggregates. This is your analytics database — sum, count, average across millions of rows.
sql
-- Typical OLAP query: aggregate across millions of rows
SELECT
DATE_TRUNC('month', created_at) AS month,
country,
SUM(amount) AS total_revenue,
COUNT(DISTINCT user_id) AS unique_customers
FROM orders
WHERE created_at >= '2025-01-01'
GROUP BY 1, 2
ORDER BY 1, 3 DESC;Running OLAP queries on an OLTP database is one of the most common causes of production performance crises. A single long-running analytical query can lock tables and block thousands of short transactional queries.
The solution is separation: OLTP data lives in PostgreSQL or MySQL, tuned for row-level operations. Analytical data is replicated to a columnar data warehouse (BigQuery, Redshift, Snowflake), tuned for full-table scans and aggregations.
The Database Selection Framework
When starting a new project or evaluating a change, these are the questions to ask:
What is the primary access pattern?
├─ Fetch by primary key, low latency → Key-Value (Redis, DynamoDB)
├─ Complex queries, relationships → Relational (PostgreSQL)
├─ Hierarchical/nested documents → Document (MongoDB)
├─ Write-heavy, massive scale → Wide-Column (Cassandra)
├─ Relationship traversal → Graph (Neo4j)
└─ Time-series metrics → Time-Series (InfluxDB)
Do you need ACID transactions?
├─ Yes, across multiple entities → Relational
└─ No, or eventual consistency is OK → NoSQL options
What is the schema stability?
├─ Fixed, well-defined schema → Relational
└─ Flexible, evolving per document → Document
What is the scale requirement?
├─ Single machine, moderate traffic → Relational (PostgreSQL handles more than you think)
└─ Horizontal scale, massive writes → NoSQL (Cassandra, DynamoDB)The most common mistake is reaching for a distributed NoSQL database when PostgreSQL running on a single large machine would handle the load — with far less operational complexity. PostgreSQL can comfortably handle tens of thousands of queries per second and terabytes of data. The threshold where you genuinely need horizontal scaling is higher than most teams think.
What's Next in This Chapter
This post answered the foundational question — what a database is and why files are not enough. The next four posts in this chapter build on this foundation:
Post 02: The relational model — tables, keys, and normalization from 1NF to BCNF
Post 03: SQL fundamentals — every JOIN type, GROUP BY, and the queries that trip everyone up
Post 04: Indexes — how B-Tree and Hash indexes work and when to use each
Post 05: Transactions — ACID properties, isolation levels, and the concurrency bugs each prevents
By the end of Chapter 1, you will have the conceptual foundation to understand everything that follows — storage internals, replication, sharding, and the distributed systems problems that make database engineering genuinely hard.
🧭 What's Next
Post 02: Relational Model & Normalization — bad schema design creates bugs invisible until production; learn to normalize correctly and know when to deliberately denormalize
Related
Part 4.5 — Database Reset, Transactions & Running Tests in CI
Reliable backend tests require strict database isolation and disciplined CI workflows. This post shows how to reset PostgreSQL safely, use transactions, and run tests confidently in automation.
Part 3.6 — Migrations, Environments & Local Dev Workflows
A productive backend engineer knows how to manage migrations, environments, Dockerized databases, and workflow scripts. This post makes those skills second nature.
Database Design Fundamentals: Schema Evolution and Migrations Explained
Learn how databases evolve with schema changes. Understand migrations, versioning, and best practices for adapting databases over time.
Comments