Topic Overview

Distributed Transactions

Learn how to maintain ACID properties across multiple nodes in distributed systems.

Distributed transactions ensure ACID properties (Atomicity, Consistency, Isolation, Durability) across multiple nodes, which is challenging in distributed systems.


The Challenge

ACID in distributed systems:

  • Atomicity: All nodes commit or all abort
  • Consistency: System remains in valid state
  • Isolation: Concurrent transactions don't interfere
  • Durability: Committed changes persist

Problem: Network partitions, node failures, and latency make this difficult.


Two-Phase Commit (2PC)

Coordinator orchestrates commit across participants.

Phase 1: Prepare

  1. Coordinator sends "prepare" to all participants
  2. Participants vote "yes" (ready) or "no" (abort)
  3. Participants write to log (prepare record)

Phase 2: Commit/Abort

  1. If all vote "yes": Coordinator sends "commit"
  2. If any vote "no": Coordinator sends "abort"
  3. Participants commit/abort and acknowledge
class TwoPhaseCommit {
  async execute(participants: Node[]): Promise<boolean> {
    // Phase 1: Prepare
    const votes = await Promise.all(
      participants.map(p => p.prepare())
    );

    if (votes.every(v => v === 'yes')) {
      // Phase 2: Commit
      await Promise.all(participants.map(p => p.commit()));
      return true;
    } else {
      // Phase 2: Abort
      await Promise.all(participants.map(p => p.abort()));
      return false;
    }
  }
}

Problems with 2PC

  • Blocking: If coordinator fails, participants block
  • Single point of failure: Coordinator is critical
  • Not partition-tolerant: Requires all nodes to be reachable

Three-Phase Commit (3PC)

Adds "pre-commit" phase to reduce blocking.

Phases

  1. CanCommit: Coordinator asks if participants can commit
  2. PreCommit: If all yes, coordinator sends pre-commit (participants ready but not committed)
  3. DoCommit: Coordinator sends commit, participants commit

Benefit: If coordinator fails in phase 2, participants can safely commit (they're in pre-commit state).


Saga Pattern

Alternative to distributed transactions using compensating transactions.

Choreography

Each service knows what to do next and how to compensate.

class SagaChoreography {
  async executeOrder(order: Order): Promise<void> {
    try {
      await this.reserveInventory(order);
      await this.chargePayment(order);
      await this.shipOrder(order);
    } catch (error) {
      // Compensate in reverse order
      await this.cancelShipment(order);
      await this.refundPayment(order);
      await this.releaseInventory(order);
    }
  }
}

Orchestration

Orchestrator coordinates the saga.

class SagaOrchestrator {
  async executeOrder(order: Order): Promise<void> {
    const steps = [
      { action: () => this.reserveInventory(order), compensate: () => this.releaseInventory(order) },
      { action: () => this.chargePayment(order), compensate: () => this.refundPayment(order) },
      { action: () => this.shipOrder(order), compensate: () => this.cancelShipment(order) }
    ];

    const completed: Step[] = [];

    try {
      for (const step of steps) {
        await step.action();
        completed.push(step);
      }
    } catch (error) {
      // Compensate in reverse
      for (const step of completed.reverse()) {
        await step.compensate();
      }
      throw error;
    }
  }
}

Examples

E-commerce Order Processing

class OrderSaga {
  async processOrder(order: Order): Promise<void> {
    // Step 1: Reserve inventory
    await this.inventoryService.reserve(order.items);
    
    // Step 2: Charge payment
    await this.paymentService.charge(order.payment);
    
    // Step 3: Create shipment
    await this.shippingService.createShipment(order);
    
    // If any step fails, compensate previous steps
  }

  async compensate(order: Order, failedStep: number): Promise<void> {
    if (failedStep >= 3) await this.shippingService.cancel(order.shipmentId);
    if (failedStep >= 2) await this.paymentService.refund(order.paymentId);
    if (failedStep >= 1) await this.inventoryService.release(order.items);
  }
}

Common Pitfalls

  • Using 2PC for everything: Too blocking, use Saga for long-running transactions
  • Not handling coordinator failure: Participants block forever. Fix: Use 3PC or timeouts
  • Saga compensation not idempotent: Retries can cause issues. Fix: Make compensations idempotent
  • Not considering network partitions: 2PC requires all nodes reachable. Fix: Use eventual consistency patterns
  • Ignoring latency: 2PC has high latency (multiple round trips). Fix: Use async patterns where possible
  • Not logging state: Can't recover from failures. Fix: Log all state transitions

Interview Questions

Beginner

Q: What is a distributed transaction and why is it challenging?

A: A distributed transaction spans multiple nodes/services and must maintain ACID properties across all of them.

Challenges:

  • Network failures: Messages can be lost, nodes unreachable
  • Node failures: Nodes can crash at any time
  • Latency: Multiple round trips increase latency
  • Partitions: Network partitions can split the system
  • Consistency: Hard to ensure all nodes agree on commit/abort

Example: E-commerce order processing - must reserve inventory, charge payment, and create shipment atomically across different services.


Intermediate

Q: Compare Two-Phase Commit (2PC) and Saga pattern. When would you use each?

A:

Two-Phase Commit (2PC):

  • ACID transactions: Strong consistency, all-or-nothing
  • Blocking: Participants block if coordinator fails
  • Synchronous: All nodes must respond
  • Use when: Need strong consistency, short transactions, all nodes must agree

Saga Pattern:

  • Eventual consistency: Each step commits independently
  • Non-blocking: Services can continue even if one fails
  • Compensating transactions: Rollback via compensation
  • Use when: Long-running transactions, services can operate independently, eventual consistency acceptable

Comparison:

  • Consistency: 2PC (strong) vs Saga (eventual)
  • Latency: 2PC (higher, multiple rounds) vs Saga (lower, sequential)
  • Failure handling: 2PC (blocks) vs Saga (continues)
  • Complexity: 2PC (simpler) vs Saga (more complex compensation logic)

Recommendation: Use 2PC for short, critical transactions. Use Saga for long-running, multi-step processes.


Senior

Q: Design a distributed transaction system for a microservices e-commerce platform. Orders involve inventory, payment, and shipping services. How do you ensure consistency, handle failures, and maintain performance?

A:

Architecture Decision:

  • Use Saga pattern (not 2PC) because:
    • Long-running process (inventory → payment → shipping)
    • Services can operate independently
    • Need high availability (can't block on coordinator)

Design:

class OrderSagaOrchestrator {
  private steps: SagaStep[] = [];
  private state: SagaState = 'pending';

  async executeOrder(order: Order): Promise<void> {
    this.state = 'executing';
    
    const steps = [
      {
        name: 'reserve-inventory',
        execute: () => this.inventoryService.reserve(order.items),
        compensate: () => this.inventoryService.release(order.items),
        timeout: 5000
      },
      {
        name: 'charge-payment',
        execute: () => this.paymentService.charge(order.payment),
        compensate: () => this.paymentService.refund(order.paymentId),
        timeout: 10000
      },
      {
        name: 'create-shipment',
        execute: () => this.shippingService.createShipment(order),
        compensate: () => this.shippingService.cancel(order.shipmentId),
        timeout: 5000
      }
    ];

    const completed: SagaStep[] = [];

    try {
      for (const step of steps) {
        // Execute with timeout and retry
        await this.executeWithRetry(step);
        completed.push(step);
        this.logStep(order.id, step.name, 'completed');
      }

      this.state = 'completed';
      await this.notifyCompletion(order);
    } catch (error) {
      this.state = 'compensating';
      await this.compensate(order, completed);
      throw error;
    }
  }

  async executeWithRetry(step: SagaStep): Promise<void> {
    const maxRetries = 3;
    let lastError: Error;

    for (let i = 0; i < maxRetries; i++) {
      try {
        await Promise.race([
          step.execute(),
          this.timeout(step.timeout)
        ]);
        return;
      } catch (error) {
        lastError = error;
        if (i < maxRetries - 1) {
          await this.sleep(1000 * (i + 1)); // Exponential backoff
        }
      }
    }

    throw lastError!;
  }

  async compensate(order: Order, completed: SagaStep[]): Promise<void> {
    // Compensate in reverse order
    for (const step of completed.reverse()) {
      try {
        await step.compensate();
        this.logStep(order.id, step.name, 'compensated');
      } catch (error) {
        // Log compensation failure, may need manual intervention
        this.logCompensationFailure(order.id, step.name, error);
        // Continue compensating other steps
      }
    }

    this.state = 'compensated';
  }

  // Idempotency: Check if step already executed
  async executeStepSafely(step: SagaStep, orderId: string): Promise<void> {
    const state = await this.getStepState(orderId, step.name);
    
    if (state === 'completed') {
      return; // Already done
    }

    if (state === 'compensated') {
      throw new Error('Step was compensated, cannot re-execute');
    }

    await step.execute();
    await this.saveStepState(orderId, step.name, 'completed');
  }
}

Failure Handling:

  1. Service unavailable: Retry with exponential backoff, timeout after max retries
  2. Partial failure: Compensate completed steps
  3. Orchestrator failure: Store state, resume on restart
  4. Network partition: Services continue independently, resolve conflicts when partition heals

Consistency:

  • Eventual consistency: Each service commits independently
  • Compensation: Rollback via compensating transactions
  • Idempotency: All operations must be idempotent (safe to retry)

Performance:

  • Async execution: Don't block on each step
  • Parallel steps: Execute independent steps in parallel
  • Caching: Cache service responses
  • Batching: Batch multiple orders if possible

Monitoring:

  • Track saga execution time
  • Monitor compensation rate
  • Alert on compensation failures
  • Track step success/failure rates

Key Takeaways

  • Distributed transactions are hard: Network failures, partitions, and latency complicate ACID
  • 2PC provides strong consistency but blocks on coordinator failure
  • 3PC reduces blocking but still requires all nodes reachable
  • Saga pattern uses compensating transactions for eventual consistency
  • Choose based on requirements: Strong consistency (2PC) vs Availability (Saga)
  • Idempotency is critical: All operations must be safe to retry
  • Log state transitions: Essential for recovery from failures
  • Compensation logic: Must handle partial failures gracefully

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.