Topic Overview

Partition Tolerance

Learn how distributed systems handle network partitions and maintain availability.

Partition tolerance is the ability of a distributed system to continue operating despite network partitions that split the system into isolated groups.


What is a Network Partition?

A network partition occurs when network failures cause nodes to be unable to communicate, splitting the system into disconnected groups.

Example: Data center A can't reach data center B, but both continue operating independently.


CAP Theorem

You can only guarantee 2 of 3:

  • Consistency: All nodes see same data
  • Availability: System responds to requests
  • Partition tolerance: System continues despite partitions

During partition:

  • CP: Choose consistency (block writes, maintain consistency)
  • AP: Choose availability (allow writes, sacrifice consistency)
  • CA: Not possible in distributed systems (partitions will happen)

Handling Partitions

CP Systems (Consistency + Partition Tolerance)

Behavior: Block operations during partition to maintain consistency.

Example: Traditional databases with strong consistency.

class CPSystem {
  async write(data: any): Promise<void> {
    // Check if we have quorum
    if (!await this.hasQuorum()) {
      throw new Error('No quorum, cannot write');
    }

    // Write to majority
    await this.writeToMajority(data);
  }

  async hasQuorum(): Promise<boolean> {
    const reachable = await this.checkConnectivity();
    return reachable.length > this.nodes.length / 2;
  }
}

AP Systems (Availability + Partition Tolerance)

Behavior: Continue operating, accept eventual consistency.

Example: Dynamo, Cassandra, CouchDB.

class APSystem {
  async write(data: any): Promise<void> {
    // Always allow writes (availability)
    await this.localWrite(data);
    
    // Replicate when partition heals
    this.queueForReplication(data);
  }

  async read(key: string): Promise<Data> {
    // Read from local (may be stale)
    return await this.localRead(key);
  }

  async partitionHeals(): Promise<void> {
    // Sync data when partition heals
    await this.reconcileConflicts();
  }
}

Examples

Dynamo-style Partition Handling

class PartitionTolerantStore {
  async write(key: string, value: any): Promise<void> {
    // Write to local and available nodes
    const available = await this.getAvailableNodes();
    
    // Write to N nodes (quorum)
    const writeCount = Math.ceil(available.length / 2) + 1;
    await this.writeToNodes(key, value, available.slice(0, writeCount));
  }

  async read(key: string): Promise<Data> {
    // Read from available nodes
    const available = await this.getAvailableNodes();
    const readCount = Math.ceil(available.length / 2) + 1;
    
    const results = await this.readFromNodes(key, available.slice(0, readCount));
    
    // Return most recent (vector clocks or timestamps)
    return this.resolveConflicts(results);
  }
}

Common Pitfalls

  • Assuming no partitions: Partitions will happen, must design for them
  • Choosing wrong CAP trade-off: Not matching system requirements
  • Not handling conflicts: AP systems need conflict resolution
  • Ignoring split-brain: Multiple leaders during partition
  • Not testing partitions: Must test partition scenarios

Interview Questions

Beginner

Q: What is partition tolerance in the context of CAP theorem?

A: Partition tolerance means the system continues operating despite network partitions that split nodes into isolated groups.

CAP theorem: You can only guarantee 2 of 3:

  • CP: Consistency + Partition tolerance (block during partition)
  • AP: Availability + Partition tolerance (continue, accept inconsistency)
  • CA: Not possible (partitions will happen in distributed systems)

Example: During partition, CP system blocks writes to maintain consistency. AP system allows writes but may have conflicts to resolve later.


Intermediate

Q: How does a system handle network partitions? Compare CP and AP approaches.

A:

CP Approach (Consistency + Partition Tolerance):

  • Behavior: Block operations if no quorum
  • Trade-off: Sacrifice availability for consistency
  • Use when: Strong consistency required (financial systems)
  • Example: Traditional SQL databases, Zookeeper

AP Approach (Availability + Partition Tolerance):

  • Behavior: Continue operating, allow writes
  • Trade-off: Sacrifice consistency for availability
  • Use when: High availability required (social media, content delivery)
  • Example: Dynamo, Cassandra, CouchDB

During partition:

  • CP: Minority partition blocks, majority continues
  • AP: Both partitions continue, resolve conflicts when partition heals

Senior

Q: Design a partition-tolerant distributed database. How do you handle writes during partitions, resolve conflicts, and ensure data consistency when partitions heal?

A:

Design: AP System with Conflict Resolution

class PartitionTolerantDatabase {
  private nodes: Node[] = [];
  private quorum: number;

  async write(key: string, value: any, vectorClock: VectorClock): Promise<void> {
    // Always allow writes (availability)
    const available = await this.getAvailableNodes();
    
    if (available.length === 0) {
      throw new Error('No nodes available');
    }

    // Write to local first
    await this.localWrite(key, { value, vectorClock, timestamp: Date.now() });

    // Replicate to available nodes (best effort)
    const writeCount = Math.min(this.quorum, available.length);
    await Promise.allSettled(
      available.slice(0, writeCount).map(node => 
        node.replicate(key, { value, vectorClock })
      )
    );

    // Queue for replication when partition heals
    if (available.length < this.nodes.length) {
      await this.queueForReplication(key, { value, vectorClock });
    }
  }

  async read(key: string): Promise<Data> {
    const available = await this.getAvailableNodes();
    const readCount = Math.min(this.quorum, available.length);

    // Read from multiple nodes
    const results = await Promise.all(
      available.slice(0, readCount).map(node => node.read(key))
    );

    // Resolve conflicts
    return this.resolveConflicts(results);
  }

  resolveConflicts(versions: Version[]): Data {
    // Strategy 1: Last-write-wins (LWW)
    const latest = versions.sort((a, b) => b.timestamp - a.timestamp)[0];

    // Strategy 2: Vector clock comparison
    const concurrent = this.findConcurrent(versions);
    if (concurrent.length > 1) {
      // Multiple concurrent writes - need application-level resolution
      return this.applicationResolve(concurrent);
    }

    return latest.value;
  }

  async partitionHeals(): Promise<void> {
    // Sync data between previously partitioned nodes
    const allNodes = await this.discoverAllNodes();
    
    for (const node of allNodes) {
      // Exchange version vectors
      const theirVersions = await node.getVersionVector();
      const myVersions = this.getVersionVector();

      // Find missing or conflicting data
      const toSync = this.compareVersions(myVersions, theirVersions);

      // Sync data
      for (const { key, version } of toSync) {
        if (version.conflict) {
          // Resolve conflict
          const resolved = await this.resolveConflict(key, version);
          await this.replicate(key, resolved);
        } else {
          // Missing data, replicate
          await this.replicate(key, version);
        }
      }
    }
  }
}

Conflict Resolution Strategies:

  1. Last-write-wins: Simple, but may lose data
  2. Vector clocks: Detect concurrent writes, require application resolution
  3. CRDTs: Automatic conflict resolution for certain data types
  4. Application-level: Let application decide how to merge

Key Takeaways

  • Partition tolerance is required in distributed systems (partitions will happen)
  • CAP theorem: Choose 2 of 3 (CP or AP, not CA)
  • CP systems: Block during partition to maintain consistency
  • AP systems: Continue operating, resolve conflicts later
  • Conflict resolution: AP systems need strategies (LWW, vector clocks, CRDTs)
  • Quorum: Majority-based operations handle partitions
  • Design for partitions: Don't assume perfect network connectivity

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.