Topic Overview
Partition Tolerance: Concepts, Trade-offs & Failure Modes
Learn how distributed systems handle network partitions and maintain availability.
Partition tolerance is the ability of a distributed system to continue operating despite network partitions that split the system into isolated groups.
What is a Network Partition?
A network partition occurs when network failures cause nodes to be unable to communicate, splitting the system into disconnected groups.
Example: Data center A can't reach data center B, but both continue operating independently.
CAP Theorem
You can only guarantee 2 of 3:
- Consistency: All nodes see same data
- Availability: System responds to requests
- Partition tolerance: System continues despite partitions
During partition:
- CP: Choose consistency (block writes, maintain consistency)
- AP: Choose availability (allow writes, sacrifice consistency)
- CA: Not possible in distributed systems (partitions will happen)
Handling Partitions
CP Systems (Consistency + Partition Tolerance)
Behavior: Block operations during partition to maintain consistency.
Example: Traditional databases with strong consistency.
1class CPSystem {2 async write(data: any): Promise<void> {3 // Check if we have quorum4 if (!await this.hasQuorum()) {5 throw new Error('No quorum, cannot write');6 }78 // Write to majority9 await this.writeToMajority(data);10 }1112 async hasQuorum(): Promise<boolean> {13 const reachable
AP Systems (Availability + Partition Tolerance)
Behavior: Continue operating, accept eventual consistency.
Example: Dynamo, Cassandra, CouchDB.
1class APSystem {2 async write(data: any): Promise<void> {3 // Always allow writes (availability)4 await this.localWrite(data);56 // Replicate when partition heals7 this.queueForReplication(data);8 }910 async read(key: string): Promise<Data> {11 // Read from local (may be stale)12 return await this.localRead(key);13 }
Examples
Dynamo-style Partition Handling
1class PartitionTolerantStore {2 async write(key: string, value: any): Promise<void> {3 // Write to local and available nodes4 const available = await this.getAvailableNodes();56 // Write to N nodes (quorum)7 const writeCount = Math.ceil(available.length / 2) + 1;8 await this.writeToNodes(key, value, available.slice(0, writeCount));
Common Pitfalls
- Assuming no partitions: Partitions will happen, must design for them
- Choosing wrong CAP trade-off: Not matching system requirements
- Not handling conflicts: AP systems need conflict resolution
- Ignoring split-brain: Multiple leaders during partition
- Not testing partitions: Must test partition scenarios
Interview Questions
Beginner
Q: What is partition tolerance in the context of CAP theorem?
A: Partition tolerance means the system continues operating despite network partitions that split nodes into isolated groups.
CAP theorem: You can only guarantee 2 of 3:
- CP: Consistency + Partition tolerance (block during partition)
- AP: Availability + Partition tolerance (continue, accept inconsistency)
- CA: Not possible (partitions will happen in distributed systems)
Example: During partition, CP system blocks writes to maintain consistency. AP system allows writes but may have conflicts to resolve later.
Intermediate
Q: How does a system handle network partitions? Compare CP and AP approaches.
A:
CP Approach (Consistency + Partition Tolerance):
- Behavior: Block operations if no quorum
- Trade-off: Sacrifice availability for consistency
- Use when: Strong consistency required (financial systems)
- Example: Traditional SQL databases, Zookeeper
AP Approach (Availability + Partition Tolerance):
- Behavior: Continue operating, allow writes
- Trade-off: Sacrifice consistency for availability
- Use when: High availability required (social media, content delivery)
- Example: Dynamo, Cassandra, CouchDB
During partition:
- CP: Minority partition blocks, majority continues
- AP: Both partitions continue, resolve conflicts when partition heals
Senior
Q: Design a partition-tolerant distributed database. How do you handle writes during partitions, resolve conflicts, and ensure data consistency when partitions heal?
A:
Design: AP System with Conflict Resolution
1class PartitionTolerantDatabase {2 private nodes: Node[] = [];3 private quorum: number;45 async write(key: string, value: any, vectorClock: VectorClock): Promise<void> {6 // Always allow writes (availability)7 const available = await this.getAvailableNodes();89 if (available.length === 0) {10 throw new Error('No nodes available');
Conflict Resolution Strategies:
- Last-write-wins: Simple, but may lose data
- Vector clocks: Detect concurrent writes, require application resolution
- CRDTs: Automatic conflict resolution for certain data types
- Application-level: Let application decide how to merge
-
Partition tolerance is required in distributed systems (partitions will happen)
-
CAP theorem: Choose 2 of 3 (CP or AP, not CA)
-
CP systems: Block during partition to maintain consistency
-
AP systems: Continue operating, resolve conflicts later
-
Conflict resolution: AP systems need strategies (LWW, vector clocks, CRDTs)
-
Quorum: Majority-based operations handle partitions
-
Design for partitions: Don't assume perfect network connectivity
-
Consensus Algorithms (Raft, Paxos) - Handling partitions in consensus
-
Leader Election - Elections during network partitions
-
Distributed Transactions - Transaction protocols and partitions
-
Clock Synchronization (NTP, Lamport) - Ordering events across partitions
-
Gossip Protocol - Information dissemination during partitions
Key Takeaways
Partition tolerance is required in distributed systems (partitions will happen)
CAP theorem: Choose 2 of 3 (CP or AP, not CA)
CP systems: Block during partition to maintain consistency
AP systems: Continue operating, resolve conflicts later
Conflict resolution: AP systems need strategies (LWW, vector clocks, CRDTs)
Quorum: Majority-based operations handle partitions
Design for partitions: Don't assume perfect network connectivity
Related Topics
Consensus Algorithms (Raft, Paxos)
Handling partitions in consensus
Leader Election
Elections during network partitions
Distributed Transactions
Transaction protocols and partitions
Clock Synchronization (NTP, Lamport)
Ordering events across partitions
Gossip Protocol
Information dissemination during partitions
What's next?