Replication Lag and Consistency: Read-Your-Own-Writes, Monotonic Reads and More

The Bug That Only Appears in Production

Your user posts a comment. They immediately refresh the page. The comment is gone — then reappears a second later. Your user thinks the site is broken.

What happened: the write went to the leader. The read went to a follower. The follower had not yet received the write. The data was there — just not on the replica that served the read.

This is replication lag — the delay between a write being committed on the leader and that write becoming visible on a replica. In development, replication lag is typically zero or near-zero. In production under load, it can be milliseconds, seconds, or minutes.

Replication lag is not a bug in the database. It is a fundamental consequence of asynchronous replication. The question is not how to eliminate it but how to build applications that remain correct in its presence. The answers are consistency guarantees — contracts between the database and the application about what can and cannot be observed.

What Causes Replication Lag

Replication lag accumulates when the replica cannot process WAL/binlog records as fast as the leader produces them:

Sources of replication lag:

1. Network latency between leader and replica
   → Cross-datacenter replication: typically 10–100ms baseline lag

2. Replica I/O bottleneck
   → Replica disk is slower than leader or under separate write load
   → Replica receiving replication stream and serving reads simultaneously

3. Replica CPU bottleneck
   → Leader applies writes across many parallel connections
   → Replica applies them in a single serial stream (MySQL default)
   → One long-running transaction on leader blocks all subsequent replication

4. Logical replication overhead
   → Row-based replication of bulk operations (UPDATE 1M rows) produces
     1M row-level events, which may take minutes to apply on replica

5. Lock contention on replica
   → Long-running read queries on replica can block replication apply

-- PostgreSQL: check replication lag per replica
SELECT
    client_addr,
    state,
    sent_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag   -- time between leader commit and replica apply
FROM pg_stat_replication;

-- MySQL: check replication lag
SHOW REPLICA STATUS\G
-- Look for: Seconds_Behind_Source (seconds of lag)
--           Replica_IO_Running: Yes
--           Replica_SQL_Running: Yes

The Four Consistency Guarantees

These are the standard consistency guarantees that databases and applications can implement to remain correct under replication lag.

Guarantee 1: Read-Your-Own-Writes (Read-Your-Own-Writes Consistency)

The problem: A user writes data, then reads it back. If the read goes to a replica that has not yet received the write, the user sees their own write disappear.

Timeline without read-your-own-writes guarantee:

T=1  User submits profile update: name → "Alice Smith"
T=2  Write committed on Leader
T=3  User refreshes profile page
T=4  Read goes to Replica (lag = 2 seconds behind leader)
T=5  Replica returns: name = "Alice Johnson"  ← old value
T=6  User sees their change was not saved — panics

[The change IS saved. The replica just hasn't received it yet.]

The guarantee: A user always reads their own writes. If you wrote it, you can read it back.

Implementation strategies:

Strategy 1 — Route reads for recently written data to the leader:

After writing, tag the user's session with "wrote at timestamp T"
For reads within X seconds of the write: route to leader
After X seconds: route to replica

  if user.last_write_at > time.now() - 60:
      read from LEADER
  else:
      read from REPLICA

Trade-off: leader handles more reads, but only for recently active writers.

────────────────────────────────────────────────────────

Strategy 2 — Track write LSN, read from replica only when caught up:

After write commits, record the LSN of that commit:
  session.last_write_lsn = current_lsn  -- e.g. 0/1A2B3C

Before reading from replica, check replica's replay LSN:
  if replica.replay_lsn >= session.last_write_lsn:
      read from this replica  -- it has our write
  else:
      wait, retry with another replica, or route to leader

PostgreSQL makes this easy:
  SELECT pg_current_wal_lsn();           -- leader's current LSN
  SELECT pg_last_wal_replay_lsn();       -- replica's applied LSN

────────────────────────────────────────────────────────

Strategy 3 — For data the user always owns, always read from leader:

User profile, account settings, billing info → always read from leader
Product listings, public content → can read from replica

Simple, predictable, no session state needed.
Works when "your own data" is a small fraction of all reads.

In practice: Most applications implement strategy 3 for critical user-facing data and strategy 1 or 2 for everything else. The key insight: not all data needs read-your-own-writes consistency — only data that the user just modified.

Guarantee 2: Monotonic Reads

The problem: A user makes two reads in sequence. The second read returns older data than the first. The user appears to see data "go back in time."

Timeline without monotonic reads:

T=1  User reads comments on post #5
T=2  Read goes to Replica 1 (lag = 0s): returns 10 comments
T=3  User scrolls down, browser fetches more
T=4  Read goes to Replica 2 (lag = 5s): returns 7 comments
     ← Replica 2 is behind Replica 1

User sees: 10 comments → 7 comments. Comments vanished.
This is confusing and feels like a bug.

The guarantee: If a user has seen a value at time T, they will never see an older value for the same data in subsequent reads within the same session.

Implementation:

Strategy — Sticky sessions (session affinity):

Assign each user session to a specific replica.
All reads for that session go to the same replica.
The same replica may be behind, but it never goes backward.

  session_id → replica_id mapping (stored in Redis, sticky cookie, etc.)
  
  replica_for_user = hash(user_id) % num_replicas
  always_read_from(replica_for_user)

Trade-off:
  + Simple to implement
  + Guarantees monotonic reads for a session
  - If the assigned replica fails, the user must be rerouted
    (and may see a temporarily older state on the new replica)
  - Uneven load distribution if some users are more active

────────────────────────────────────────────────────────

Alternative — Track read LSN in session:

After each read, store the LSN of the replica that served it:
  session.read_lsn = replica.replay_lsn

Before the next read:
  only read from replicas where replay_lsn >= session.read_lsn

This ensures reads only go to replicas that are at least as
up-to-date as the last replica the user read from.

Guarantee 3: Consistent Prefix Reads

The problem: In a distributed system with multiple writers or partitions, reads may see writes in a different order than they were applied, violating causality.

Timeline without consistent prefix reads:

Reality:
  T=1  Alice writes: "How are you?"
  T=2  Bob writes: "I'm fine, thanks!"  (Bob read Alice's message)

What Replica A sees (receives Bob's write first due to network):
  T=1  "I'm fine, thanks!"  ← looks like Bob spoke first
  T=2  "How are you?"       ← then Alice asked

A third user reading from Replica A sees Bob answer a question
that has not been asked yet. The conversation makes no sense.

The guarantee: If a sequence of writes happened in a certain order, reads will see them in that order — causally related writes are never seen out of order.

Implementation:

Strategy 1 — Single leader with sequential writes:
  If all causally related writes go through the same leader,
  the leader assigns them LSNs in order.
  Replicas apply WAL in order → consistent prefix preserved.
  
  Works naturally in single-leader replication.

────────────────────────────────────────────────────────

Strategy 2 — Logical timestamps / vector clocks:
  Each write carries a logical timestamp.
  Readers only see writes up to the timestamp they have already seen.
  
  Alice's message: {id: 1, ts: 100, body: "How are you?"}
  Bob's message:   {id: 2, ts: 101, body: "I'm fine!", reply_to: {id: 1, ts: 100}}

  A reader tracks: "I have seen up to ts=100"
  Bob's message (ts=101) is not shown until Alice's (ts=100) is visible.

────────────────────────────────────────────────────────

Strategy 3 — Causality tracking in the application:
  The application explicitly records causal dependencies.
  Before showing B, verify A is visible.
  
  This is how Google Spanner, CockroachDB, and similar systems
  implement external consistency — they use synchronized clocks
  (TrueTime in Spanner) to assign timestamps that respect causality
  globally across datacenters.

Guarantee 4: Bounded Staleness (Monotonic Writes with Lag Bound)

The problem: Async replicas may be arbitrarily far behind. An application reading from replicas may be reading data that is minutes or hours old without knowing it.

The guarantee: Reads never return data older than some bound X (e.g. 30 seconds). If a replica is more than X behind the leader, reads are routed elsewhere or blocked.

Implementation:

Each replica exposes its current lag:
  replica.lag_seconds = (leader_lsn - replica_replay_lsn) / estimated_lsn_rate

Load balancer only routes reads to replicas where lag < threshold:
  if replica.lag_seconds <= 30:
      route_read(replica)
  else:
      route_read(leader)   -- fallback to leader

PostgreSQL: pg_stat_replication.replay_lag provides this directly.

────────────────────────────────────────────────────────

In practice — AWS RDS read replicas:
  ReplicaLag CloudWatch metric monitors lag per replica
  Application logic: if ReplicaLag > 30s → read from primary
  Aurora Global Database: typically < 1 second lag across regions

Combining Guarantees in Practice

Real applications need multiple guarantees simultaneously. Here is how a typical read-heavy web application implements them:

class DatabaseRouter:
    MAX_LAG_SECONDS = 30

    def get_connection(self, user_id, query_type, session):
        """
        Route reads based on consistency requirements.
        """
        if query_type == "user_own_data":
            # Read-your-own-writes: check if user recently wrote
            if session.get("last_write_age_seconds", 999) < 60:
                return self.leader_connection()

        if query_type == "feed_or_listing":
            # Monotonic reads: use sticky replica for this session
            replica = self.get_session_replica(session, user_id)
            if replica and replica.lag_seconds < self.MAX_LAG_SECONDS:
                return replica.connection()

        # Bounded staleness: only use replicas within lag threshold
        available = [r for r in self.replicas
                     if r.lag_seconds < self.MAX_LAG_SECONDS]

        if available:
            return min(available, key=lambda r: r.lag_seconds).connection()

        # All replicas too far behind — fall back to leader
        return self.leader_connection()

    def after_write(self, session, lsn):
        session["last_write_lsn"] = lsn
        session["last_write_at"] = time.time()
        session["last_write_age_seconds"] = 0

The Consistency Hierarchy

These guarantees form a hierarchy — stronger guarantees subsume weaker ones:

Weakest                                                    Strongest
───────────────────────────────────────────────────────────────────→
  Eventual     Consistent    Monotonic    Read-Your-    Lineariz-
  Consistency  Prefix Reads  Reads        Own-Writes    ability
  
  (anything    (causal       (no time     (your writes  (perfect
  eventually   order         reversal     visible to    real-time
  converges)   preserved)    within       you)          ordering)
                             session)

Most NoSQL      Cassandra     Common web    Most web      Spanner,
default         with          applications  applications  CockroachDB
                lightweight                              strong mode
                transactions

Linearizability (the strongest) means the database behaves as if there is only one copy of the data, and every operation appears to take effect instantaneously at some point between its invocation and completion. This is what SERIALIZABLE isolation + synchronous replication provides — and it is expensive.

For most applications, read-your-own-writes + monotonic reads is the sweet spot: strong enough that users never observe confusing behavior, cheap enough that replica reads remain viable.

Practical Checklist

Before deploying a read-replica setup:

✅ Monitored replication lag per replica (alert if > threshold)
✅ Reads for user's own data routed to leader (or LSN-tracked replica)
✅ Session sticky to a single replica (monotonic reads)
✅ Replicas excluded from pool when lag exceeds SLA bound
✅ Application handles replica unavailability (fallback to leader)
✅ Never send writes to a replica (silent no-op or error depending on config)
✅ Tested behavior under simulated lag (inject artificial delay in dev)

🧭 What's Next

Post 14: Failover, Consensus & High Availability — when the leader fails, something must elect a new one; learn how Raft and Paxos solve leader election and why automatic failover is harder than it sounds

The Bug That Only Appears in Production

What Causes Replication Lag

The Four Consistency Guarantees

Guarantee 1: Read-Your-Own-Writes (Read-Your-Own-Writes Consistency)

Guarantee 2: Monotonic Reads

Guarantee 3: Consistent Prefix Reads

Guarantee 4: Bounded Staleness (Monotonic Writes with Lag Bound)

Combining Guarantees in Practice

The Consistency Hierarchy

Practical Checklist

🧭 What's Next

Related

Database Replication: Single-Leader, Multi-Leader and Leaderless Explained

WAL, Crash Recovery and Durability: How Databases Survive Power Failures

B-Tree vs LSM-Tree: Why InnoDB and RocksDB Make Different Trade-offs

Leave a comment

Comments