Distributed Transactions: Two-Phase Commit, Saga and When to Use Each
The Problem: ACID Across Multiple Nodes
Within a single database, ACID transactions are straightforward — the database engine handles atomicity, isolation, and durability internally. When a business operation spans multiple databases or services, ACID guarantees do not extend across the boundary.
Single-database transaction: trivial
─────────────────────────────────────
BEGIN;
UPDATE inventory SET stock = stock - 1 WHERE product_id = 7;
INSERT INTO orders (customer_id, product_id, amount) VALUES (42, 7, 99.99);
UPDATE accounts SET balance = balance - 99.99 WHERE customer_id = 42;
COMMIT;
-- All three succeed or all three roll back. Atomic. ✅
Multi-database operation: hard
─────────────────────────────────────
Step 1: UPDATE inventory.stock (Inventory DB)
Step 2: INSERT INTO orders (Orders DB)
Step 3: UPDATE accounts.balance (Payments DB)
If Step 2 succeeds and Step 3 fails:
- Inventory is decremented ✅
- Order is created ✅
- Payment is not taken ❌
No single COMMIT can span three separate databases.
You need a protocol.Two protocols dominate: Two-Phase Commit (2PC) for synchronous safety, and Saga for asynchronous availability.
Two-Phase Commit (2PC)
2PC is a distributed coordination protocol that achieves atomic commitment across multiple participants. Either all participants commit or all roll back — guaranteed.
The Protocol
A coordinator (transaction manager) orchestrates multiple participants (database nodes or services).
Phase 1 — Prepare (Voting):
Coordinator → all participants: "Prepare to commit transaction T"
Each participant:
1. Acquires all locks needed for the transaction
2. Writes all changes to a temporary log (not yet committed)
3. Ensures it CAN commit (has resources, constraints satisfied)
4. Responds to coordinator:
- VOTE YES: "I am prepared, I will commit if you tell me to"
- VOTE NO: "I cannot commit, please abort"
Timeline:
Coordinator ──Prepare──→ Participant A → writes to WAL, acquires locks → YES
Coordinator ──Prepare──→ Participant B → writes to WAL, acquires locks → YES
Coordinator ──Prepare──→ Participant C → constraint violation found → NOPhase 2 — Commit or Abort:
If ALL participants voted YES:
Coordinator writes COMMIT decision to its own durable log
Coordinator ──COMMIT──→ all participants
Each participant: applies changes, releases locks, acknowledges
Transaction is committed ✅
If ANY participant voted NO:
Coordinator writes ABORT decision to its own durable log
Coordinator ──ABORT──→ all participants
Each participant: discards prepared changes, releases locks
Transaction is rolled back ❌
Full timeline:
T=1 Coordinator: BEGIN distributed transaction
T=2 Coordinator → A, B, C: PREPARE
T=3 A, B, C: acquire locks, write to WAL, send VOTE YES
T=4 Coordinator: all YES → write COMMIT to durable log
T=5 Coordinator → A, B, C: COMMIT
T=6 A, B, C: apply changes, release locks, send ACK
T=7 Coordinator: transaction completeThe Blocking Problem
2PC has a fundamental flaw: if the coordinator crashes after Phase 1 but before Phase 2, participants are stuck holding locks indefinitely.
Failure scenario:
T=1 All participants send VOTE YES
T=2 Coordinator writes COMMIT to its own log
T=3 Coordinator crashes ← right here
T=4 Participants are in PREPARED state, holding all locks
T=5 Participants cannot proceed (don't know if COMMIT or ABORT)
T=6 Participants cannot release locks (might need to commit)
Result: System is BLOCKED until coordinator recovers
- All locks held by this transaction are unavailable
- Any other transaction touching the same rows is blocked
- Recovery requires coordinator to restart and re-send Phase 2
- If coordinator's disk fails: manual intervention requiredThis is why 2PC is called a blocking protocol — it can block indefinitely if the coordinator fails. The alternative, Three-Phase Commit (3PC), adds a pre-commit phase to eliminate most blocking scenarios but is rarely used in practice due to its complexity and susceptibility to network partitions.
When 2PC Makes Sense
Despite the blocking problem, 2PC is appropriate when:
Strong atomicity is non-negotiable. Financial systems, payment processing, inventory management — where a partial failure means data corruption.
Participants are few and trusted. 2PC latency grows linearly with the number of participants and network round-trips. 2–5 participants across a fast internal network is reasonable; 20 participants across a wide-area network is not.
Participants are long-lived services, not ephemeral microservices. The coordinator must be able to reach every participant during recovery.
2PC in practice:
-- PostgreSQL distributed transaction via PREPARE TRANSACTION
-- (PostgreSQL's implementation of the 2PC prepare phase)
-- On Coordinator (application code):
BEGIN;
-- Execute operations on multiple databases via dblink or foreign data wrappers
UPDATE inventory SET stock = stock - 1 WHERE product_id = 7;
-- Phase 1: Prepare
PREPARE TRANSACTION 'txn-order-12345';
-- Transaction is now in prepared state — survives restart
-- Check prepared transactions (should be empty in normal operation)
SELECT * FROM pg_prepared_xacts;
-- Phase 2: Commit (once all participants prepared)
COMMIT PREPARED 'txn-order-12345';
-- or if abort needed:
ROLLBACK PREPARED 'txn-order-12345';2PC in distributed databases:
CockroachDB, Google Spanner, and YugabyteDB implement 2PC internally and transparently — you write normal SQL, the database handles distributed coordination. This is what "NewSQL" means: SQL semantics across distributed nodes without manual 2PC management.
The Saga Pattern
Saga decomposes a distributed transaction into a sequence of local transactions, each on a single service or database. If a step fails, compensating transactions undo the work done by preceding steps.
Saga trades atomicity for availability. Instead of blocking on coordinator health, each step completes independently. The system remains available even if individual services are slow or temporarily down.
Choreography-Based Saga
Each service listens for events and publishes events when done. No central coordinator.
E-commerce order saga (choreography):
OrderService InventoryService PaymentService NotificationService
│ │ │ │
Create order │ │ │
Publish: OrderCreated │ │ │
│ ─────────────────────→│ │ │
│ Reserve stock │ │
│ Publish: StockReserved │ │
│ │ ──────────────────→│ │
│ │ Charge payment │
│ │ Publish: PaymentCharged │
│ │ │ ─────────────────────→│
│ │ │ Send confirmation
│ │ │ email
If PaymentService fails:
Publish: PaymentFailed
InventoryService listens → Release stock (compensating transaction)
OrderService listens → Cancel order (compensating transaction)Advantages:
Loose coupling — services do not call each other directly
Services can be developed and deployed independently
No coordinator → no blocking
Disadvantages:
Difficult to track overall transaction state (what step are we on?)
Hard to debug — transaction logic is spread across multiple services
Cyclic dependencies can emerge as event handlers grow complex
Orchestration-Based Saga
A central saga orchestrator explicitly calls each service in sequence and manages compensation on failure.
OrderSaga Orchestrator:
Step 1: Call InventoryService.reserveStock(productId=7, qty=1)
├─ Success → proceed to Step 2
└─ Failure → (nothing to compensate yet) → fail saga
Step 2: Call PaymentService.charge(customerId=42, amount=99.99)
├─ Success → proceed to Step 3
└─ Failure → compensate: InventoryService.releaseStock(productId=7, qty=1)
→ fail saga
Step 3: Call OrderService.confirmOrder(orderId=12345)
├─ Success → saga complete ✅
└─ Failure → compensate: PaymentService.refund(customerId=42, amount=99.99)
InventoryService.releaseStock(productId=7, qty=1)
→ fail saga
Compensation order is always reverse of execution order:
Normal: Step 1 → Step 2 → Step 3
Compensate: Step 3's compensation → Step 2's compensation → Step 1's compensationclass OrderSaga:
def execute(self, customer_id: int, product_id: int, amount: float):
completed_steps = []
try:
# Step 1: Reserve inventory
reservation = inventory_service.reserve_stock(product_id, qty=1)
completed_steps.append(('inventory', reservation.id))
# Step 2: Charge payment
payment = payment_service.charge(customer_id, amount)
completed_steps.append(('payment', payment.id))
# Step 3: Confirm order
order = order_service.confirm_order(customer_id, product_id, amount)
completed_steps.append(('order', order.id))
return order
except Exception as e:
# Compensate in reverse order
self._compensate(completed_steps)
raise SagaFailedError(f"Saga failed at step: {e}") from e
def _compensate(self, completed_steps: list):
for service, resource_id in reversed(completed_steps):
try:
if service == 'inventory':
inventory_service.release_stock(resource_id)
elif service == 'payment':
payment_service.refund(resource_id)
elif service == 'order':
order_service.cancel_order(resource_id)
except Exception as comp_error:
# Compensation itself failed — log for manual intervention
logger.critical(f"Compensation failed for {service}/{resource_id}: {comp_error}")
alert_oncall(f"Manual intervention required: {service}/{resource_id}")Advantages:
Central visibility into saga state — easy to monitor and debug
Explicit control flow — easy to reason about what happens on failure
Easier to add new steps or modify existing ones
Disadvantages:
Orchestrator is a central component — must be highly available
Risk of coupling services to the orchestrator's knowledge of their APIs
Idempotency: The Critical Requirement
Both 2PC and Saga require that each operation be idempotent — calling it multiple times produces the same result as calling it once.
Why? Network failures may cause duplicate calls. A timeout does not mean the operation failed — it means you do not know whether it succeeded. Retrying without idempotency causes double-charges, double-shipments, and duplicate records.
# Non-idempotent: charging twice means two charges
def charge_payment(customer_id: int, amount: float):
payment = Payment.create(customer_id=customer_id, amount=amount)
stripe.charge(payment.id, amount)
return payment
# Idempotent: same idempotency_key = no duplicate charge
def charge_payment(customer_id: int, amount: float, idempotency_key: str):
# Check if we already processed this exact request
existing = Payment.find_by_idempotency_key(idempotency_key)
if existing:
return existing # return the already-created payment, no new charge
payment = Payment.create(
customer_id=customer_id,
amount=amount,
idempotency_key=idempotency_key
)
stripe.charge(payment.id, amount, idempotency_key=idempotency_key)
return payment
# Usage: idempotency_key = stable ID derived from the saga's context
charge_payment(42, 99.99, idempotency_key=f"order-{order_id}-payment")Compensating transactions must also be idempotent. If the refund call times out and is retried, the second refund call must detect the first refund already happened and return without issuing a second refund.
2PC vs Saga: The Decision
2PC vs Saga:
2PC Saga
─────────────────────────────────────────────────────────────────
Atomicity True atomicity Eventual consistency
All or nothing at commit Compensations may fail
Blocking Yes — coordinator failure No — each step independent
can block indefinitely
Latency Higher — requires 2 Lower — each step async
synchronous round-trips
Availability Lower — blocked by Higher — no coordinator
coordinator health dependency per step
Failure handling Automatic rollback Requires compensating
(coordinator manages) transactions (you manage)
Cross-service Tight coupling required Loose coupling via events
coupling (coordinator knows all) or explicit calls
Debugging Simpler — one atomic unit Complex — distributed state
Best for Few trusted services, Many microservices,
financial atomicity, long-running workflows,
same infrastructure cross-org boundariesThe practical rule:
Same infrastructure (same company's databases): 2PC via a NewSQL database (CockroachDB, Spanner) that handles it transparently, or an XA transaction manager.
Across microservices or organizations: Saga — 2PC requires tight coupling and direct database access that breaks service encapsulation.
Long-running workflows (minutes to days): Saga always — no coordinator should hold locks for hours.
Chapter 5 Complete
You have finished the Scaling & Distributed Systems chapter:
✅ Post 15: Vertical vs Horizontal Scaling
✅ Post 16: Database Partitioning & Sharding
✅ Post 17: CAP Theorem & PACELC
✅ Post 18: Distributed Transactions: 2PC & Saga
Chapter 6 moves from relational databases to NoSQL — the document, key-value, and wide-column stores that trade SQL semantics for scale, flexibility, and write throughput.
🧭 What's Next — Chapter 6: NoSQL Databases
Post 19: NoSQL Explained — document, key-value, wide-column, and when NOT to use SQL; understanding which NoSQL family fits which problem is the difference between a fast system and a slow one
Related
CAP Theorem and PACELC: What They Mean for Real Database Decisions
CAP theorem is misunderstood constantly. PACELC is more useful but almost unknown. This post explains both with concrete database examples and the decisions they actually inform.
Database Partitioning and Sharding: Range, Hash and Consistent Hashing
Sharding splits data across multiple nodes and creates a new class of problems. This post covers range, hash, and consistent hashing with trade-offs that matter in production.
Failover, Consensus and High Availability: Raft, Paxos and Automatic Failover
Automatic failover sounds simple — consensus makes it hard. This post explains leader election, Raft vs Paxos trade-offs, and what high availability means in distributed databases.
Comments