Topic Overview

Distributed Transactions: Concepts, Trade-offs & Failure Modes

Learn how to maintain ACID properties across multiple nodes in distributed systems.

Senior12 min read

Distributed transactions ensure ACID properties (Atomicity, Consistency, Isolation, Durability) across multiple nodes, which is challenging in distributed systems.


The Challenge

ACID in distributed systems:

  • Atomicity: All nodes commit or all abort
  • Consistency: System remains in valid state
  • Isolation: Concurrent transactions don't interfere
  • Durability: Committed changes persist

Problem: Network partitions, node failures, and latency make this difficult.


Two-Phase Commit (2PC)

Coordinator orchestrates commit across participants.

Phase 1: Prepare

  1. Coordinator sends "prepare" to all participants
  2. Participants vote "yes" (ready) or "no" (abort)
  3. Participants write to log (prepare record)

Phase 2: Commit/Abort

  1. If all vote "yes": Coordinator sends "commit"
  2. If any vote "no": Coordinator sends "abort"
  3. Participants commit/abort and acknowledge
1

Problems with 2PC

  • Blocking: If coordinator fails, participants block
  • Single point of failure: Coordinator is critical
  • Not partition-tolerant: Requires all nodes to be reachable

Three-Phase Commit (3PC)

Adds "pre-commit" phase to reduce blocking.

Phases

  1. CanCommit: Coordinator asks if participants can commit
  2. PreCommit: If all yes, coordinator sends pre-commit (participants ready but not committed)
  3. DoCommit: Coordinator sends commit, participants commit

Benefit: If coordinator fails in phase 2, participants can safely commit (they're in pre-commit state).


Saga Pattern

Alternative to distributed transactions using compensating transactions.

Choreography

Each service knows what to do next and how to compensate.

1class SagaChoreography {
2 async executeOrder(order: Order): Promise<void> {
3 try {
4 await this.reserveInventory(order);
5 await this.chargePayment(order);
6 await this.shipOrder(order);
7 } catch (error) {
8 // Compensate in reverse order
9 await this.cancelShipment(order);
10 await this.refundPayment(order)

Orchestration

Orchestrator coordinates the saga.

1class SagaOrchestrator {
2 async executeOrder(order: Order): Promise<void> {
3 const steps = [
4 { action: () => this.reserveInventory(order), compensate: () => this.releaseInventory(order) },
5 { action: () => this.chargePayment(order), compensate: () => thisorder

Examples

E-commerce Order Processing

1class OrderSaga {
2 async processOrder(order: Order): Promise<void> {
3 // Step 1: Reserve inventory
4 await this.inventoryService.reserve(order.items);
5
6 // Step 2: Charge payment
7 await this.paymentService.charge(order.payment);
8
9 // Step 3: Create shipment
10 await this.shippingService.createShipment(order);
11
12 // If any step fails, compensate previous steps
13 }
14
15 async compensate(order: Order, failedStep

Common Pitfalls

  • Using 2PC for everything: Too blocking, use Saga for long-running transactions
  • Not handling coordinator failure: Participants block forever. Fix: Use 3PC or timeouts
  • Saga compensation not idempotent: Retries can cause issues. Fix: Make compensations idempotent
  • Not considering network partitions: 2PC requires all nodes reachable. Fix: Use eventual consistency patterns
  • Ignoring latency: 2PC has high latency (multiple round trips). Fix: Use async patterns where possible
  • Not logging state: Can't recover from failures. Fix: Log all state transitions

Interview Questions

Beginner

Q: What is a distributed transaction and why is it challenging?

A: A distributed transaction spans multiple nodes/services and must maintain ACID properties across all of them.

Challenges:

  • Network failures: Messages can be lost, nodes unreachable
  • Node failures: Nodes can crash at any time
  • Latency: Multiple round trips increase latency
  • Partitions: Network partitions can split the system
  • Consistency: Hard to ensure all nodes agree on commit/abort

Example: E-commerce order processing - must reserve inventory, charge payment, and create shipment atomically across different services.


Intermediate

Q: Compare Two-Phase Commit (2PC) and Saga pattern. When would you use each?

A:

Two-Phase Commit (2PC):

  • ACID transactions: Strong consistency, all-or-nothing
  • Blocking: Participants block if coordinator fails
  • Synchronous: All nodes must respond
  • Use when: Need strong consistency, short transactions, all nodes must agree

Saga Pattern:

  • Eventual consistency: Each step commits independently
  • Non-blocking: Services can continue even if one fails
  • Compensating transactions: Rollback via compensation
  • Use when: Long-running transactions, services can operate independently, eventual consistency acceptable

Comparison:

  • Consistency: 2PC (strong) vs Saga (eventual)
  • Latency: 2PC (higher, multiple rounds) vs Saga (lower, sequential)
  • Failure handling: 2PC (blocks) vs Saga (continues)
  • Complexity: 2PC (simpler) vs Saga (more complex compensation logic)

Recommendation: Use 2PC for short, critical transactions. Use Saga for long-running, multi-step processes.


Senior

Q: Design a distributed transaction system for a microservices e-commerce platform. Orders involve inventory, payment, and shipping services. How do you ensure consistency, handle failures, and maintain performance?

A:

Architecture Decision:

  • Use Saga pattern (not 2PC) because:
    • Long-running process (inventory → payment → shipping)
    • Services can operate independently
    • Need high availability (can't block on coordinator)

Design:

1class OrderSagaOrchestrator {
2 private steps: SagaStep[] = [];
3 private state: SagaState = 'pending';
4
5 async executeOrder(order: Order): Promise<void> {
6 this.state = 'executing';
7
8 const steps = [
9 {
10 name: 'reserve-inventory',
11 execute: () => this.inventoryService.reserve(order.items),
12 compensate inventoryServiceorderitems

Failure Handling:

  1. Service unavailable: Retry with exponential backoff, timeout after max retries
  2. Partial failure: Compensate completed steps
  3. Orchestrator failure: Store state, resume on restart
  4. Network partition: Services continue independently, resolve conflicts when partition heals

Consistency:

  • Eventual consistency: Each service commits independently
  • Compensation: Rollback via compensating transactions
  • Idempotency: All operations must be idempotent (safe to retry)

Performance:

  • Async execution: Don't block on each step
  • Parallel steps: Execute independent steps in parallel
  • Caching: Cache service responses
  • Batching: Batch multiple orders if possible

Monitoring:

  • Track saga execution time
  • Monitor compensation rate
  • Alert on compensation failures
  • Track step success/failure rates

  • Distributed transactions are hard: Network failures, partitions, and latency complicate ACID

  • 2PC provides strong consistency but blocks on coordinator failure

  • 3PC reduces blocking but still requires all nodes reachable

  • Saga pattern uses compensating transactions for eventual consistency

  • Choose based on requirements: Strong consistency (2PC) vs Availability (Saga)

  • Idempotency is critical: All operations must be safe to retry

  • Log state transitions: Essential for recovery from failures

  • Compensation logic: Must handle partial failures gracefully

  • Two-Phase Commit (2PC) - Coordinator-based atomic commit protocol

  • Three-Phase Commit (3PC) - Non-blocking alternative to 2PC

  • Idempotency - Making operations safe to retry

  • Fault Tolerance - Handling failures in distributed systems

  • Partition Tolerance - CAP theorem and network partitions

Key Takeaways

Distributed transactions are hard: Network failures, partitions, and latency complicate ACID

2PC provides strong consistency but blocks on coordinator failure

3PC reduces blocking but still requires all nodes reachable

Saga pattern uses compensating transactions for eventual consistency

Choose based on requirements: Strong consistency (2PC) vs Availability (Saga)

Idempotency is critical: All operations must be safe to retry

Log state transitions: Essential for recovery from failures

Compensation logic: Must handle partial failures gracefully


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.