NoSQL Explained: Document, Key-Value, Wide-Column and When NOT to Use SQL
The Name Is Misleading
"NoSQL" does not mean "no SQL" or "anti-SQL." It originated as shorthand for "Not Only SQL" — a recognition that relational databases are excellent tools for many problems and wrong tools for others.
The NoSQL movement emerged from companies like Google, Amazon, Facebook, and LinkedIn hitting walls that relational databases could not break through: petabyte-scale datasets, millions of writes per second, global distribution with sub-millisecond latency. They built new storage systems optimized for their specific workloads — and those systems became the NoSQL landscape.
NoSQL databases are not universally better than relational databases. They are better at specific things: document storage, key-value access, write-heavy time-series, graph traversal. For everything else — complex queries, multi-table joins, strong ACID transactions, ad-hoc analytics — a relational database remains superior.
The skill is matching the storage system to the workload. This post maps the four NoSQL families to the problems they solve.
The Four NoSQL Families
Family 1: Key-Value Stores
The simplest NoSQL model: every record is a key-value pair. Lookup by key is O(1). No schema, no query language, no joins. Just get and set.
Data model:
Key: "user:42:session"
Value: {"user_id": 42, "token": "abc123", "expires_at": 1720000000}
Operations:
GET user:42:session → returns the value
SET user:42:session value → stores the value
DEL user:42:session → deletes the key
TTL user:42:session → returns remaining time-to-liveWhat key-value stores are good at:
Session storage (user:42:session → session data)
Caching (product:7:price → cached price)
Rate limiting counters (ratelimit:ip:192.168.1.1 → count)
Feature flags (feature:dark_mode:user_42 → true/false)
Distributed locks
What they are not good at:
Queries by value (find all sessions expiring in the next hour)
Relationships between keys
Anything requiring a schema or constraints
Examples: Redis, Memcached, DynamoDB (in key-value mode), etcd, Amazon ElastiCache
Family 2: Document Stores
Documents are self-contained, schema-flexible records — typically JSON or BSON. Each document can have a different structure. Related data is embedded within the document rather than spread across normalized tables.
Data model — a single order document:
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"customer": {
"id": 42,
"name": "Alice Smith",
"email": "alice@example.com"
},
"items": [
{"product_id": 7, "name": "Laptop", "qty": 1, "price": 999.00},
{"product_id": 14, "name": "Mouse", "qty": 2, "price": 29.99}
],
"total": 1058.98,
"status": "shipped",
"shipping_address": {
"street": "123 Main St",
"city": "New York",
"country": "US"
},
"created_at": ISODate("2025-03-15T10:23:00Z")
}What document stores are good at:
Hierarchical data that maps naturally to objects in application code
Variable-structure data (product attributes differ per category)
Rapid development when schema is evolving
Content management (articles, user profiles, catalogs)
Applications where the natural access pattern is "fetch one entity with all its data"
What they are not good at:
Complex multi-document joins (you can do them, but they are expensive)
Strong cross-document ACID transactions (improving, but not native)
Highly relational data where normalization is required for correctness
Examples: MongoDB, CouchDB, Firestore, Amazon DocumentDB, Couchbase
Family 3: Wide-Column Stores
Data is stored in tables with rows and columns, but unlike relational databases, columns are dynamic — each row can have a different set of columns. Designed for massive write throughput and horizontal scaling.
Data model (Cassandra):
Table: user_activity
Partition key: user_id
Clustering key: timestamp (descending)
user_id │ timestamp │ event_type │ metadata
─────────┼─────────────────────┼─────────────┼──────────────────────
42 │ 2025-03-15 10:23:00 │ page_view │ {url: "/products/7"}
42 │ 2025-03-15 10:22:45 │ search │ {query: "laptop"}
42 │ 2025-03-15 10:20:00 │ login │ {ip: "1.2.3.4"}
17 │ 2025-03-15 10:25:00 │ purchase │ {order_id: "ABC"}What wide-column stores are good at:
Write-heavy workloads (millions of writes per second)
Time-series data (events, metrics, IoT sensor readings)
Queries with a known access pattern — "all events for user 42 in the last 7 days"
Massive scale across many nodes (petabytes, globally distributed)
What they are not good at:
Ad-hoc queries (must know access pattern at schema design time)
Multi-row transactions
Secondary indexes at scale (expensive in Cassandra)
Any query that does not include the partition key
Examples: Apache Cassandra, Google Bigtable, HBase, Amazon Keyspaces, ScyllaDB
Family 4: Graph Databases
Data is stored as nodes (entities) and edges (relationships). Traversing relationships is a first-class operation — not a JOIN, but a native graph traversal that is O(edges) rather than O(rows).
Data model (property graph):
Nodes:
Person {id: 1, name: "Alice"}
Person {id: 2, name: "Bob"}
Person {id: 3, name: "Carol"}
Product {id: 101, name: "Laptop"}
Edges:
Alice KNOWS Bob {since: 2020}
Bob KNOWS Carol {since: 2019}
Alice PURCHASED Product:101 {date: "2025-03-01"}
Query (Cypher — Neo4j's query language):
MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..3]->(friend:Person)
RETURN friend.name
→ Returns everyone within 3 degrees of Alice
MATCH (p:Person)-[:PURCHASED]->(prod:Product)
WHERE prod.name = "Laptop"
RETURN p.name
→ Returns everyone who bought a LaptopWhat graph databases are good at:
Social networks (followers, friends of friends)
Recommendation engines ("people who bought X also bought Y")
Fraud detection (connected accounts, shared payment methods)
Knowledge graphs
Access control (hierarchical permissions, role inheritance)
What they are not good at:
Bulk data processing (not optimized for table scans)
Simple key-value or document access patterns
Write-heavy workloads without complex relationships
Examples: Neo4j, Amazon Neptune, ArangoDB, TigerGraph, Dgraph
SQL vs NoSQL: The Decision Matrix
Workload characteristic → Recommended approach
─────────────────────────────────────────────────────────────────────
Complex queries, ad-hoc joins, aggregations
→ Relational (PostgreSQL, MySQL)
Strong ACID transactions across multiple entities
→ Relational, or NewSQL (CockroachDB, Spanner)
Hierarchical data, variable schema per record
→ Document (MongoDB, Firestore)
Key-based access, caching, sessions, counters
→ Key-Value (Redis, DynamoDB)
Write-heavy time-series, IoT, event streams
→ Wide-Column (Cassandra, Bigtable)
Relationship traversal, social graph, fraud detection
→ Graph (Neo4j, Neptune)
Simple lookups + write throughput + eventual consistency OK
→ DynamoDB (key-value + document hybrid)
Full-text search
→ Elasticsearch, OpenSearch (specialized search engine)
Analytics, OLAP, columnar aggregations
→ Columnar data warehouse (BigQuery, Redshift, Snowflake)The Schema Trade-off: Flexibility vs Safety
The most visible difference between relational and document databases is schema enforcement.
Relational databases enforce schema at write time: The database rejects any row that violates the table's column definitions, types, or constraints. Bad data never enters the database.
Document databases enforce schema at read time (by default): Any document can be written regardless of shape. The application is responsible for validating structure. Invalid data can silently enter the database and only cause errors when read.
// MongoDB: both writes succeed — no schema validation by default
db.users.insertOne({ name: "Alice", email: "alice@example.com", age: 30 });
db.users.insertOne({ name: "Bob", email: "not-an-email", age: "thirty" });
// ↑ invalid email ↑ wrong type — still inserts
// MongoDB schema validation (JSON Schema — adds relational-style enforcement):
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "email"],
properties: {
name: { bsonType: "string" },
email: { bsonType: "string", pattern: "^.+@.+\\..+$" },
age: { bsonType: "int", minimum: 0, maximum: 150 }
}
}
}
});
// Now invalid documents are rejected at write timeSchema flexibility is a feature for rapid prototyping and genuinely variable data. It is a liability for data that should have consistent structure. Modern document databases support optional schema validation for when you need both.
Polyglot Persistence: Using Multiple Databases
Most production systems use multiple databases — each chosen for its strengths.
Typical production architecture:
PostgreSQL (primary store)
→ User accounts, orders, payments, inventory
→ Strong ACID, complex queries, consistent schema
Redis (caching + sessions)
→ Session tokens (TTL-based expiry)
→ Product price cache (invalidated on update)
→ Rate limiting counters
→ Leaderboards (sorted sets)
MongoDB (flexible content)
→ Product catalog (variable attributes per category)
→ User-generated content (reviews, comments)
→ CMS content (articles with variable structure)
Cassandra (event stream)
→ User activity events (millions/second)
→ IoT sensor readings
→ Time-series metrics
Elasticsearch (search)
→ Full-text product search
→ Log aggregation and search
Neo4j (recommendations)
→ "Customers who bought X also bought Y"
→ Fraud detection (connected account networks)Polyglot persistence trade-offs:
Each database has its own operational overhead (backup, monitoring, scaling)
Data synchronization between stores introduces eventual consistency
More systems = more failure modes = more oncall incidents
The pragmatic rule: Start with PostgreSQL. Add Redis when you need caching or sessions. Add a specialized database only when PostgreSQL genuinely cannot serve the workload — not when you think it might struggle someday.
When to Reach for NoSQL: The Honest Assessment
The honest reasons to use NoSQL over PostgreSQL:
Legitimate reasons:
Write throughput genuinely exceeds what PostgreSQL can handle (>100K writes/sec per table)
Data is truly schemaless or highly variable per record (product catalog with 50+ different attribute sets)
Horizontal scaling is required and NewSQL is too expensive or complex
The access pattern is exclusively key-based with no need for SQL queries
You need sub-millisecond cached reads at massive scale
Illegitimate reasons (often given, usually wrong):
"We might need to scale eventually" → PostgreSQL handles more than you think
"JSON is more flexible" → PostgreSQL has excellent JSONB support
"NoSQL is faster" → depends entirely on workload; PostgreSQL with indexes often wins
"Everyone is using it" → cargo-culting a technology choice
"We don't want to think about schema" → schema-free means application bugs silently corrupt data
🧭 What's Next
Post 20: MongoDB Deep Dive — easy to start, hard to optimize; embedding vs referencing, the full aggregation pipeline, and the index strategies that make the difference between fast and painfully slow
Related
Distributed Transactions: Two-Phase Commit, Saga and When to Use Each
Distributed transactions are hard. 2PC gives safety at the cost of availability. Saga gives availability at the cost of complexity. This post explains both and when each is right.
CAP Theorem and PACELC: What They Mean for Real Database Decisions
CAP theorem is misunderstood constantly. PACELC is more useful but almost unknown. This post explains both with concrete database examples and the decisions they actually inform.
Database Partitioning and Sharding: Range, Hash and Consistent Hashing
Sharding splits data across multiple nodes and creates a new class of problems. This post covers range, hash, and consistent hashing with trade-offs that matter in production.
Comments