Topic Overview
Distributed Transactions: Concepts, Trade-offs & Failure Modes
Learn how to maintain ACID properties across multiple nodes in distributed systems.
Distributed transactions ensure ACID properties (Atomicity, Consistency, Isolation, Durability) across multiple nodes, which is challenging in distributed systems.
The Challenge
ACID in distributed systems:
- Atomicity: All nodes commit or all abort
- Consistency: System remains in valid state
- Isolation: Concurrent transactions don't interfere
- Durability: Committed changes persist
Problem: Network partitions, node failures, and latency make this difficult.
Two-Phase Commit (2PC)
Coordinator orchestrates commit across participants.
Phase 1: Prepare
- Coordinator sends "prepare" to all participants
- Participants vote "yes" (ready) or "no" (abort)
- Participants write to log (prepare record)
Phase 2: Commit/Abort
- If all vote "yes": Coordinator sends "commit"
- If any vote "no": Coordinator sends "abort"
- Participants commit/abort and acknowledge
1
Problems with 2PC
- Blocking: If coordinator fails, participants block
- Single point of failure: Coordinator is critical
- Not partition-tolerant: Requires all nodes to be reachable
Three-Phase Commit (3PC)
Adds "pre-commit" phase to reduce blocking.
Phases
- CanCommit: Coordinator asks if participants can commit
- PreCommit: If all yes, coordinator sends pre-commit (participants ready but not committed)
- DoCommit: Coordinator sends commit, participants commit
Benefit: If coordinator fails in phase 2, participants can safely commit (they're in pre-commit state).
Saga Pattern
Alternative to distributed transactions using compensating transactions.
Choreography
Each service knows what to do next and how to compensate.
1class SagaChoreography {2 async executeOrder(order: Order): Promise<void> {3 try {4 await this.reserveInventory(order);5 await this.chargePayment(order);6 await this.shipOrder(order);7 } catch (error) {8 // Compensate in reverse order9 await this.cancelShipment(order);10 await this.refundPayment(order)
Orchestration
Orchestrator coordinates the saga.
1class SagaOrchestrator {2 async executeOrder(order: Order): Promise<void> {3 const steps = [4 { action: () => this.reserveInventory(order), compensate: () => this.releaseInventory(order) },5 { action: () => this.chargePayment(order), compensate: () => thisorder
Examples
E-commerce Order Processing
1class OrderSaga {2 async processOrder(order: Order): Promise<void> {3 // Step 1: Reserve inventory4 await this.inventoryService.reserve(order.items);56 // Step 2: Charge payment7 await this.paymentService.charge(order.payment);89 // Step 3: Create shipment10 await this.shippingService.createShipment(order);1112 // If any step fails, compensate previous steps13 }1415 async compensate(order: Order, failedStep
Common Pitfalls
- Using 2PC for everything: Too blocking, use Saga for long-running transactions
- Not handling coordinator failure: Participants block forever. Fix: Use 3PC or timeouts
- Saga compensation not idempotent: Retries can cause issues. Fix: Make compensations idempotent
- Not considering network partitions: 2PC requires all nodes reachable. Fix: Use eventual consistency patterns
- Ignoring latency: 2PC has high latency (multiple round trips). Fix: Use async patterns where possible
- Not logging state: Can't recover from failures. Fix: Log all state transitions
Interview Questions
Beginner
Q: What is a distributed transaction and why is it challenging?
A: A distributed transaction spans multiple nodes/services and must maintain ACID properties across all of them.
Challenges:
- Network failures: Messages can be lost, nodes unreachable
- Node failures: Nodes can crash at any time
- Latency: Multiple round trips increase latency
- Partitions: Network partitions can split the system
- Consistency: Hard to ensure all nodes agree on commit/abort
Example: E-commerce order processing - must reserve inventory, charge payment, and create shipment atomically across different services.
Intermediate
Q: Compare Two-Phase Commit (2PC) and Saga pattern. When would you use each?
A:
Two-Phase Commit (2PC):
- ACID transactions: Strong consistency, all-or-nothing
- Blocking: Participants block if coordinator fails
- Synchronous: All nodes must respond
- Use when: Need strong consistency, short transactions, all nodes must agree
Saga Pattern:
- Eventual consistency: Each step commits independently
- Non-blocking: Services can continue even if one fails
- Compensating transactions: Rollback via compensation
- Use when: Long-running transactions, services can operate independently, eventual consistency acceptable
Comparison:
- Consistency: 2PC (strong) vs Saga (eventual)
- Latency: 2PC (higher, multiple rounds) vs Saga (lower, sequential)
- Failure handling: 2PC (blocks) vs Saga (continues)
- Complexity: 2PC (simpler) vs Saga (more complex compensation logic)
Recommendation: Use 2PC for short, critical transactions. Use Saga for long-running, multi-step processes.
Senior
Q: Design a distributed transaction system for a microservices e-commerce platform. Orders involve inventory, payment, and shipping services. How do you ensure consistency, handle failures, and maintain performance?
A:
Architecture Decision:
- Use Saga pattern (not 2PC) because:
- Long-running process (inventory → payment → shipping)
- Services can operate independently
- Need high availability (can't block on coordinator)
Design:
1class OrderSagaOrchestrator {2 private steps: SagaStep[] = [];3 private state: SagaState = 'pending';45 async executeOrder(order: Order): Promise<void> {6 this.state = 'executing';78 const steps = [9 {10 name: 'reserve-inventory',11 execute: () => this.inventoryService.reserve(order.items),12 compensate inventoryServiceorderitems
Failure Handling:
- Service unavailable: Retry with exponential backoff, timeout after max retries
- Partial failure: Compensate completed steps
- Orchestrator failure: Store state, resume on restart
- Network partition: Services continue independently, resolve conflicts when partition heals
Consistency:
- Eventual consistency: Each service commits independently
- Compensation: Rollback via compensating transactions
- Idempotency: All operations must be idempotent (safe to retry)
Performance:
- Async execution: Don't block on each step
- Parallel steps: Execute independent steps in parallel
- Caching: Cache service responses
- Batching: Batch multiple orders if possible
Monitoring:
- Track saga execution time
- Monitor compensation rate
- Alert on compensation failures
- Track step success/failure rates
-
Distributed transactions are hard: Network failures, partitions, and latency complicate ACID
-
2PC provides strong consistency but blocks on coordinator failure
-
3PC reduces blocking but still requires all nodes reachable
-
Saga pattern uses compensating transactions for eventual consistency
-
Choose based on requirements: Strong consistency (2PC) vs Availability (Saga)
-
Idempotency is critical: All operations must be safe to retry
-
Log state transitions: Essential for recovery from failures
-
Compensation logic: Must handle partial failures gracefully
-
Two-Phase Commit (2PC) - Coordinator-based atomic commit protocol
-
Three-Phase Commit (3PC) - Non-blocking alternative to 2PC
-
Idempotency - Making operations safe to retry
-
Fault Tolerance - Handling failures in distributed systems
-
Partition Tolerance - CAP theorem and network partitions
Key Takeaways
Distributed transactions are hard: Network failures, partitions, and latency complicate ACID
2PC provides strong consistency but blocks on coordinator failure
3PC reduces blocking but still requires all nodes reachable
Saga pattern uses compensating transactions for eventual consistency
Choose based on requirements: Strong consistency (2PC) vs Availability (Saga)
Idempotency is critical: All operations must be safe to retry
Log state transitions: Essential for recovery from failures
Compensation logic: Must handle partial failures gracefully
What's next?