NoSQL Explained: Document, Key-Value, Wide-Column and When NOT to Use SQL

The Name Is Misleading

"NoSQL" does not mean "no SQL" or "anti-SQL." It originated as shorthand for "Not Only SQL" — a recognition that relational databases are excellent tools for many problems and wrong tools for others.

The NoSQL movement emerged from companies like Google, Amazon, Facebook, and LinkedIn hitting walls that relational databases could not break through: petabyte-scale datasets, millions of writes per second, global distribution with sub-millisecond latency. They built new storage systems optimized for their specific workloads — and those systems became the NoSQL landscape.

NoSQL databases are not universally better than relational databases. They are better at specific things: document storage, key-value access, write-heavy time-series, graph traversal. For everything else — complex queries, multi-table joins, strong ACID transactions, ad-hoc analytics — a relational database remains superior.

The skill is matching the storage system to the workload. This post maps the four NoSQL families to the problems they solve.

The Four NoSQL Families

Family 1: Key-Value Stores

The simplest NoSQL model: every record is a key-value pair. Lookup by key is O(1). No schema, no query language, no joins. Just get and set.

Data model:
  Key: "user:42:session"
  Value: {"user_id": 42, "token": "abc123", "expires_at": 1720000000}

Operations:
  GET user:42:session        → returns the value
  SET user:42:session value  → stores the value
  DEL user:42:session        → deletes the key
  TTL user:42:session        → returns remaining time-to-live

What key-value stores are good at:

Session storage (user:42:session → session data)
Caching (product:7:price → cached price)
Rate limiting counters (ratelimit:ip:192.168.1.1 → count)
Feature flags (feature:dark_mode:user_42 → true/false)
Distributed locks

What they are not good at:

Queries by value (find all sessions expiring in the next hour)
Relationships between keys
Anything requiring a schema or constraints

Examples: Redis, Memcached, DynamoDB (in key-value mode), etcd, Amazon ElastiCache

Family 2: Document Stores

Documents are self-contained, schema-flexible records — typically JSON or BSON. Each document can have a different structure. Related data is embedded within the document rather than spread across normalized tables.

Data model — a single order document:
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "customer": {
    "id": 42,
    "name": "Alice Smith",
    "email": "alice@example.com"
  },
  "items": [
    {"product_id": 7,  "name": "Laptop",  "qty": 1, "price": 999.00},
    {"product_id": 14, "name": "Mouse",   "qty": 2, "price": 29.99}
  ],
  "total": 1058.98,
  "status": "shipped",
  "shipping_address": {
    "street": "123 Main St",
    "city": "New York",
    "country": "US"
  },
  "created_at": ISODate("2025-03-15T10:23:00Z")
}

What document stores are good at:

Hierarchical data that maps naturally to objects in application code
Variable-structure data (product attributes differ per category)
Rapid development when schema is evolving
Content management (articles, user profiles, catalogs)
Applications where the natural access pattern is "fetch one entity with all its data"

What they are not good at:

Complex multi-document joins (you can do them, but they are expensive)
Strong cross-document ACID transactions (improving, but not native)
Highly relational data where normalization is required for correctness

Examples: MongoDB, CouchDB, Firestore, Amazon DocumentDB, Couchbase

Family 3: Wide-Column Stores

Data is stored in tables with rows and columns, but unlike relational databases, columns are dynamic — each row can have a different set of columns. Designed for massive write throughput and horizontal scaling.

Data model (Cassandra):

Table: user_activity
Partition key: user_id
Clustering key: timestamp (descending)

user_id  │ timestamp           │ event_type  │ metadata
─────────┼─────────────────────┼─────────────┼──────────────────────
42       │ 2025-03-15 10:23:00 │ page_view   │ {url: "/products/7"}
42       │ 2025-03-15 10:22:45 │ search      │ {query: "laptop"}
42       │ 2025-03-15 10:20:00 │ login       │ {ip: "1.2.3.4"}
17       │ 2025-03-15 10:25:00 │ purchase    │ {order_id: "ABC"}

What wide-column stores are good at:

Write-heavy workloads (millions of writes per second)
Time-series data (events, metrics, IoT sensor readings)
Queries with a known access pattern — "all events for user 42 in the last 7 days"
Massive scale across many nodes (petabytes, globally distributed)

What they are not good at:

Ad-hoc queries (must know access pattern at schema design time)
Multi-row transactions
Secondary indexes at scale (expensive in Cassandra)
Any query that does not include the partition key

Examples: Apache Cassandra, Google Bigtable, HBase, Amazon Keyspaces, ScyllaDB

Family 4: Graph Databases

Data is stored as nodes (entities) and edges (relationships). Traversing relationships is a first-class operation — not a JOIN, but a native graph traversal that is O(edges) rather than O(rows).

Data model (property graph):

Nodes:
  Person {id: 1, name: "Alice"}
  Person {id: 2, name: "Bob"}
  Person {id: 3, name: "Carol"}
  Product {id: 101, name: "Laptop"}

Edges:
  Alice KNOWS Bob {since: 2020}
  Bob KNOWS Carol {since: 2019}
  Alice PURCHASED Product:101 {date: "2025-03-01"}

Query (Cypher — Neo4j's query language):
  MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..3]->(friend:Person)
  RETURN friend.name
  → Returns everyone within 3 degrees of Alice

  MATCH (p:Person)-[:PURCHASED]->(prod:Product)
  WHERE prod.name = "Laptop"
  RETURN p.name
  → Returns everyone who bought a Laptop

What graph databases are good at:

Social networks (followers, friends of friends)
Recommendation engines ("people who bought X also bought Y")
Fraud detection (connected accounts, shared payment methods)
Knowledge graphs
Access control (hierarchical permissions, role inheritance)

What they are not good at:

Bulk data processing (not optimized for table scans)
Simple key-value or document access patterns
Write-heavy workloads without complex relationships

Examples: Neo4j, Amazon Neptune, ArangoDB, TigerGraph, Dgraph

SQL vs NoSQL: The Decision Matrix

Workload characteristic → Recommended approach
─────────────────────────────────────────────────────────────────────

Complex queries, ad-hoc joins, aggregations
  → Relational (PostgreSQL, MySQL)

Strong ACID transactions across multiple entities
  → Relational, or NewSQL (CockroachDB, Spanner)

Hierarchical data, variable schema per record
  → Document (MongoDB, Firestore)

Key-based access, caching, sessions, counters
  → Key-Value (Redis, DynamoDB)

Write-heavy time-series, IoT, event streams
  → Wide-Column (Cassandra, Bigtable)

Relationship traversal, social graph, fraud detection
  → Graph (Neo4j, Neptune)

Simple lookups + write throughput + eventual consistency OK
  → DynamoDB (key-value + document hybrid)

Full-text search
  → Elasticsearch, OpenSearch (specialized search engine)

Analytics, OLAP, columnar aggregations
  → Columnar data warehouse (BigQuery, Redshift, Snowflake)

The Schema Trade-off: Flexibility vs Safety

The most visible difference between relational and document databases is schema enforcement.

Relational databases enforce schema at write time: The database rejects any row that violates the table's column definitions, types, or constraints. Bad data never enters the database.

Document databases enforce schema at read time (by default): Any document can be written regardless of shape. The application is responsible for validating structure. Invalid data can silently enter the database and only cause errors when read.

// MongoDB: both writes succeed — no schema validation by default
db.users.insertOne({ name: "Alice", email: "alice@example.com", age: 30 });
db.users.insertOne({ name: "Bob", email: "not-an-email", age: "thirty" });
//                                  ↑ invalid email    ↑ wrong type — still inserts

// MongoDB schema validation (JSON Schema — adds relational-style enforcement):
db.createCollection("users", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["name", "email"],
      properties: {
        name:  { bsonType: "string" },
        email: { bsonType: "string", pattern: "^.+@.+\\..+$" },
        age:   { bsonType: "int", minimum: 0, maximum: 150 }
      }
    }
  }
});
// Now invalid documents are rejected at write time

Schema flexibility is a feature for rapid prototyping and genuinely variable data. It is a liability for data that should have consistent structure. Modern document databases support optional schema validation for when you need both.

Polyglot Persistence: Using Multiple Databases

Most production systems use multiple databases — each chosen for its strengths.

Typical production architecture:

PostgreSQL (primary store)
  → User accounts, orders, payments, inventory
  → Strong ACID, complex queries, consistent schema

Redis (caching + sessions)
  → Session tokens (TTL-based expiry)
  → Product price cache (invalidated on update)
  → Rate limiting counters
  → Leaderboards (sorted sets)

MongoDB (flexible content)
  → Product catalog (variable attributes per category)
  → User-generated content (reviews, comments)
  → CMS content (articles with variable structure)

Cassandra (event stream)
  → User activity events (millions/second)
  → IoT sensor readings
  → Time-series metrics

Elasticsearch (search)
  → Full-text product search
  → Log aggregation and search

Neo4j (recommendations)
  → "Customers who bought X also bought Y"
  → Fraud detection (connected account networks)

Polyglot persistence trade-offs:

Each database has its own operational overhead (backup, monitoring, scaling)
Data synchronization between stores introduces eventual consistency
More systems = more failure modes = more oncall incidents

The pragmatic rule: Start with PostgreSQL. Add Redis when you need caching or sessions. Add a specialized database only when PostgreSQL genuinely cannot serve the workload — not when you think it might struggle someday.

When to Reach for NoSQL: The Honest Assessment

The honest reasons to use NoSQL over PostgreSQL:

Legitimate reasons:

Write throughput genuinely exceeds what PostgreSQL can handle (>100K writes/sec per table)
Data is truly schemaless or highly variable per record (product catalog with 50+ different attribute sets)
Horizontal scaling is required and NewSQL is too expensive or complex
The access pattern is exclusively key-based with no need for SQL queries
You need sub-millisecond cached reads at massive scale

Illegitimate reasons (often given, usually wrong):

"We might need to scale eventually" → PostgreSQL handles more than you think
"JSON is more flexible" → PostgreSQL has excellent JSONB support
"NoSQL is faster" → depends entirely on workload; PostgreSQL with indexes often wins
"Everyone is using it" → cargo-culting a technology choice
"We don't want to think about schema" → schema-free means application bugs silently corrupt data

🧭 What's Next

Post 20: MongoDB Deep Dive — easy to start, hard to optimize; embedding vs referencing, the full aggregation pipeline, and the index strategies that make the difference between fast and painfully slow

The Name Is Misleading

The Four NoSQL Families

Family 1: Key-Value Stores

Family 2: Document Stores

Family 3: Wide-Column Stores

Family 4: Graph Databases

SQL vs NoSQL: The Decision Matrix

The Schema Trade-off: Flexibility vs Safety

Polyglot Persistence: Using Multiple Databases

When to Reach for NoSQL: The Honest Assessment

🧭 What's Next

Related

Distributed Transactions: Two-Phase Commit, Saga and When to Use Each

CAP Theorem and PACELC: What They Mean for Real Database Decisions

Database Partitioning and Sharding: Range, Hash and Consistent Hashing

Leave a comment

Comments