← Back to databases

Database Topic

Data Replication

Master database replication strategies for high availability and performance. Essential for distributed systems and database interviews.

Data Replication

Why This Matters

Think of data replication like having backup copies of important documents. If your original document is lost or damaged, you have copies. Database replication works the same way—if your primary database fails, you have replicas that can take over. Also, you can distribute read load across replicas, improving performance.

This matters because databases are single points of failure. If your database goes down, your entire application goes down. Replication provides high availability—if the primary fails, a replica can take over. It also improves performance—you can distribute read queries across replicas, reducing load on the primary.

In interviews, when someone asks "How would you design a highly available system?", they're testing whether you understand replication. Do you know the difference between synchronous and asynchronous replication? Do you understand replication lag? Most engineers don't. They just enable replication and wonder why data is inconsistent.

What Engineers Usually Get Wrong

Most engineers think "replication means automatic failover." But replication just copies data—it doesn't automatically failover. You need additional mechanisms (health checks, failover logic) to detect failures and switch to replicas. Also, failover isn't instant—it takes time to detect failures and switch, during which the system might be unavailable.

Engineers also don't understand replication lag. With asynchronous replication, replicas are always slightly behind the primary (seconds to minutes). If you read from a replica immediately after writing to the primary, you might not see your changes. This is why "read your writes" consistency is hard with replication.

How This Breaks Systems in the Real World

A service was using asynchronous replication. Users would write data to the primary, then immediately read from a replica. But the replica hadn't received the update yet (replication lag). Users saw stale data and thought their writes failed. This created confusion and support tickets. The fix? Read from the primary immediately after writes, or use synchronous replication for critical operations. But synchronous replication has overhead—it slows down writes.

Another story: A service was using replication for high availability, but didn't have automatic failover. When the primary database failed, the service was down for 30 minutes while the team manually failed over to a replica. During this time, the service was completely unavailable. The fix? Implement automatic failover with health checks. Detect failures quickly and switch to replicas automatically.


Why Replicate?

High Availability

If one server fails, others can continue serving requests.

Primary Server (fails)
Secondary Server (takes over)

Performance

Distribute read load across multiple servers.

Read Request → Load Balancer → Replica 1, 2, or 3
Write Request → Primary Server

Geographic Distribution

Place data closer to users.

US Users → US Replica
EU Users → EU Replica
Asia Users → Asia Replica

Disaster Recovery

Backup copies in different locations.


Replication Models

Master-Slave (Primary-Replica)

One primary server accepts writes, replicas handle reads.

Primary (Writes + Reads)
    ↓ (replication)
Replica 1 (Reads only)
Replica 2 (Reads only)
Replica 3 (Reads only)

Characteristics:

  • Simple to understand
  • Single point of failure (primary)
  • Asynchronous or synchronous replication
  • Replicas can lag behind primary

Use case: Read scaling, backup

Master-Master (Multi-Master)

Multiple servers can accept writes.

Master 1 (Writes + Reads) ←→ Master 2 (Writes + Reads)
    ↓                           ↓
Replica 1                    Replica 2

Characteristics:

  • No single point of failure
  • More complex conflict resolution
  • Can have write conflicts
  • Requires careful design

Use case: Multi-region deployments, high availability


Replication Methods

Statement-Based Replication (SBR)

Replicate SQL statements.

-- Primary executes:
UPDATE users SET balance = balance - 100 WHERE id = 1;

-- Replica executes same statement:
UPDATE users SET balance = balance - 100 WHERE id = 1;

Pros:

  • Compact (only SQL statements)
  • Easy to understand

Cons:

  • Non-deterministic functions (NOW(), RAND())
  • Stored procedures/triggers can cause issues

Row-Based Replication (RBR)

Replicate changed rows.

Primary: Row (id=1, balance=900) changed to (id=1, balance=800)
Replica: Apply same row change

Pros:

  • More reliable (exact data changes)
  • Handles non-deterministic functions

Cons:

  • Larger log size
  • More network bandwidth

Mixed Replication

Uses statement-based by default, switches to row-based when needed.


Synchronization Modes

Synchronous Replication

Primary waits for replica to confirm before committing.

1. Primary writes to log
2. Primary sends to replica
3. Replica writes to log
4. Replica confirms to primary
5. Primary commits transaction

Pros:

  • No data loss if primary fails
  • Strong consistency

Cons:

  • Slower writes (waits for replica)
  • Replica failure blocks primary

Use case: Financial systems, critical data

Asynchronous Replication

Primary doesn't wait for replica confirmation.

1. Primary writes to log
2. Primary commits transaction
3. Primary sends to replica (async)
4. Replica applies changes

Pros:

  • Fast writes
  • Primary not blocked by replica issues

Cons:

  • Possible data loss if primary fails
  • Replica lag (eventual consistency)

Use case: Most web applications, read scaling

Semi-Synchronous Replication

Primary waits for at least one replica (not all).

Primary waits for at least 1 of N replicas to confirm

Balance: Between sync and async


Replication Topologies

Single Replica

Primary → Replica

Simple, but single point of failure for replica.

Chain Replication

Primary → Replica 1 → Replica 2 → Replica 3

Reduces load on primary, but longer replication chain.

Star Topology

        Primary
       /   |   \
Replica1 Replica2 Replica3

All replicas get updates directly from primary.

Tree Topology

        Primary
       /       \
Replica1      Replica2
    |             |
Replica3      Replica4

Hierarchical structure, good for geographic distribution.


Consistency Models

Strong Consistency

All replicas have same data at all times.

Write to Primary → Immediately visible on all replicas

Challenge: Requires synchronous replication, slower

Eventual Consistency

Replicas will converge to same state, but may differ temporarily.

Write to Primary → Eventually visible on all replicas
Read from Replica → May see slightly stale data

Challenge: Application must handle stale reads

Read Your Writes

User always sees their own writes, even if reading from replica.

User writes → Primary
User reads → Primary (or replica with user's writes)

Implementation: Route user's reads to primary or wait for replica to catch up


Common Patterns

Read Replicas

Use replicas for read-heavy workloads.

# Write to primary
db.write(user_data)  # Goes to primary

# Read from replica
user = replica_db.read(user_id)  # Goes to replica

Failover

Automatically promote replica to primary if primary fails.

1. Monitor primary health
2. Detect failure
3. Promote replica to primary
4. Update DNS/load balancer
5. Application reconnects to new primary

Geographic Replication

Replicate to different regions.

US Primary → EU Replica → Asia Replica

Benefits:

  • Lower latency for local users
  • Disaster recovery
  • Compliance (data residency)

Database-Specific Examples

PostgreSQL Streaming Replication

-- On primary: Enable WAL archiving
wal_level = replica
max_wal_senders = 3

-- On replica: Configure replication
primary_conninfo = 'host=primary.example.com port=5432 user=replicator'

MySQL Replication

-- On primary: Enable binary logging
[mysqld]
log-bin=mysql-bin
server-id=1

-- On replica: Configure replication
CHANGE MASTER TO
  MASTER_HOST='primary.example.com',
  MASTER_USER='replicator',
  MASTER_PASSWORD='password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

MongoDB Replica Set

// Initialize replica set
rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "primary:27017" },
    { _id: 1, host: "replica1:27017" },
    { _id: 2, host: "replica2:27017" }
  ]
});

Monitoring Replication

Key Metrics

  • Replication lag: How far behind replicas are
  • Throughput: Replication rate (bytes/second)
  • Errors: Replication failures
  • Health: Replica status (up/down)

Monitoring Queries

-- PostgreSQL: Check replication lag
SELECT 
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as lag_bytes
FROM pg_stat_replication;

-- MySQL: Check replication status
SHOW SLAVE STATUS\G

Best Practices

  1. Monitor lag: Set up alerts for replication lag
  2. Test failover: Regularly test failover procedures
  3. Backup replicas: Use replicas for backups to reduce primary load
  4. Geographic distribution: Place replicas close to users
  5. Consistent backups: Backup from replicas, not primary
  6. Handle stale reads: Design application to tolerate eventual consistency

Common Challenges

Replication Lag

Replicas fall behind primary, causing stale reads.

Solutions:

  • Monitor and alert on lag
  • Route critical reads to primary
  • Use read-your-writes consistency

Split-Brain

Multiple primaries think they're the primary.

Solutions:

  • Use quorum-based election
  • Proper failover procedures
  • Network partitioning handling

Write Conflicts

In multi-master setups, simultaneous writes to same data.

Solutions:

  • Conflict resolution strategies
  • Application-level conflict handling
  • Last-write-wins or merge strategies

Trade-offs

Synchronous Replication

  • Pros: No data loss, strong consistency
  • Cons: Slower writes, primary blocked by replica issues

Asynchronous Replication

  • Pros: Fast writes, primary not blocked
  • Cons: Possible data loss, eventual consistency

Multiple Replicas

  • Pros: High availability, read scaling
  • Cons: More complexity, more resources, consistency challenges

Interview Questions

1. Beginner Question

Q: What is database replication, and why is it important?

A: Database replication is the process of copying and maintaining database data across multiple servers.

Why important:

  • High availability: If one server fails, others can continue serving
  • Performance: Distribute read load across multiple servers
  • Geographic distribution: Place data closer to users
  • Disaster recovery: Backup copies in different locations

Example:

Primary Server (handles writes)
    ↓ (replication)
Replica 1 (handles reads)
Replica 2 (handles reads)
Replica 3 (handles reads)

Benefits:

  • Reads can be distributed across replicas (read scaling)
  • If primary fails, a replica can be promoted (high availability)
  • Replicas can be in different regions (lower latency)

2. Intermediate Question

Q: Explain the difference between synchronous and asynchronous replication. When would you use each?

A:

Synchronous replication:

  • Primary waits for replica to confirm before committing
  • Pros: No data loss if primary fails, strong consistency
  • Cons: Slower writes (waits for replica), primary blocked if replica fails
  • Use when: Financial systems, critical data where consistency > performance

Asynchronous replication:

  • Primary doesn't wait for replica confirmation
  • Pros: Fast writes, primary not blocked by replica issues
  • Cons: Possible data loss if primary fails, eventual consistency (replica lag)
  • Use when: Most web applications, read scaling, performance > strong consistency

Example:

-- Synchronous: Primary waits
Primary: Write → Send to Replica → Wait for confirmation → Commit
-- If replica is slow, write is slow

-- Asynchronous: Primary doesn't wait
Primary: WriteCommit → Send to Replica (async)
-- Write is fast, but replica may lag behind

Trade-off: Consistency vs. Performance. Choose based on requirements.

3. Senior-Level System Question

Q: Design a globally distributed database system for a social media platform with 1B users. The system needs to support: low latency worldwide, high availability, and handle 10M writes/second. How would you implement replication?

A:

Multi-region replication strategy:

  1. Regional primaries with cross-region replication:

    US Primary → US Replicas (3)
         ↓ (async replication)
    EU Primary → EU Replicas (3)
         ↓ (async replication)
    Asia Primary → Asia Replicas (3)
    
  2. Sharding by user_id:

    # Route users to their region
    def get_user_shard(user_id):
        region = get_user_region(user_id)  # Based on signup location
        return f"{region}_primary"
    
  3. Replication topology:

    • Within region: Synchronous replication (low latency, strong consistency)
    • Cross-region: Asynchronous replication (high latency, eventual consistency)
  4. Read strategy:

    def read_user_data(user_id):
        user_region = get_user_region(user_id)
        
        # Read from local replica (low latency)
        replica = get_replica(user_region)
        return replica.query_user(user_id)
    
  5. Write strategy:

    def write_user_data(user_id, data):
        user_region = get_user_region(user_id)
        primary = get_primary(user_region)
        
        # Write to local primary (synchronous within region)
        primary.write_user(user_id, data)
        
        # Replicate to other regions (asynchronous)
        async_replicate_to_other_regions(user_id, data)
    
  6. Conflict resolution:

    # Last-write-wins with vector clocks
    def resolve_conflict(user_data_us, user_data_eu):
        if user_data_us.timestamp > user_data_eu.timestamp:
            return user_data_us
        else:
            return user_data_eu
    
  7. Failover strategy:

    # Automatic failover within region
    def handle_primary_failure(region):
        # Promote replica to primary
        new_primary = elect_replica(region)
        
        # Update DNS/load balancer
        update_routing(region, new_primary)
        
        # Notify other regions
        notify_regions_of_failover(region, new_primary)
    
  8. Monitoring:

    • Replication lag (alert if > 5 seconds)
    • Primary health (heartbeat monitoring)
    • Cross-region latency
    • Conflict rates

Trade-offs:

  • Consistency: Eventual consistency across regions (acceptable for social media)
  • Latency: Low latency within region, higher across regions
  • Complexity: Multi-region replication adds operational complexity
  • Cost: Multiple replicas and cross-region bandwidth

Failure Stories You'll Recognize

The Replication Lag: A service was using asynchronous replication. Users would write data to the primary, then immediately read from a replica. But the replica hadn't received the update yet (replication lag). Users saw stale data and thought their writes failed. This created confusion and support tickets. The fix? Read from the primary immediately after writes, or use synchronous replication for critical operations. But synchronous replication has overhead—it slows down writes.

The Manual Failover: A service was using replication for high availability, but didn't have automatic failover. When the primary database failed, the service was down for 30 minutes while the team manually failed over to a replica. During this time, the service was completely unavailable. The fix? Implement automatic failover with health checks. Detect failures quickly and switch to replicas automatically.

The Split-Brain: A service was using multi-master replication. Two masters were both accepting writes. When the network partitioned, each master thought it was the only one. Both accepted writes, creating conflicting data. When the network recovered, the databases had inconsistent data. The fix? Use a single master for writes, or implement conflict resolution for multi-master setups.

What Interviewers Are Really Testing

They want to hear you talk about replication as a trade-off between consistency and performance. Junior engineers say "replication provides high availability." Senior engineers say "replication provides high availability and can improve read performance. But async replication has lag, and sync replication slows writes. Choose based on your requirements. Monitor replication lag and implement automatic failover."

When they ask "How would you design a highly available system?", they're testing:

  • Do you understand replication strategies?

  • Do you know the trade-offs between sync and async replication?

  • Can you handle replication lag and failover?

  • Replication provides high availability—if one server fails, others can continue

  • Master-slave (primary-replica) is most common—primary handles writes, replicas handle reads

  • Synchronous replication ensures no data loss but slows writes

  • Asynchronous replication is faster but may lose data if primary fails

  • Read replicas distribute read load and improve performance

  • Geographic replication places data closer to users for lower latency

  • Replication lag is inevitable with async replication—design for eventual consistency

  • Failover requires automatic promotion of replica to primary

  • Monitor replication lag and set up alerts for high lag

  • Choose based on requirements—consistency vs. performance trade-offs

  • Multi-master replication is complex but provides no single point of failure

  • Replication topologies (chain, star, tree) have different trade-offs

How InterviewCrafted Will Teach This

We'll teach this through production failures, not definitions. Instead of memorizing "replication copies data," you'll learn through scenarios like "why did users see stale data immediately after writing?"

You'll see how replication affects availability, performance, and consistency. When an interviewer asks "how would you design a highly available system?", you'll think about replication strategies, lag, and failover—not just "use replication."

  • Transactions - Replication must maintain transaction consistency. Understanding transactions helps understand replication guarantees.
  • ACID Properties - Replication affects durability and consistency. Understanding ACID helps understand replication trade-offs.
  • Isolation Levels - Isolation levels affect replication lag and consistency. Understanding isolation helps design replication strategies.
  • Partitioning - Partitioned tables can be replicated independently. Understanding partitioning helps design replicated systems.
  • NoSQL Basics - Many NoSQL databases use replication for availability. Understanding NoSQL helps understand replication patterns.

Key Takeaways

Replication provides high availability—if one server fails, others can continue

Master-slave (primary-replica) is most common—primary handles writes, replicas handle reads

Synchronous replication ensures no data loss but slows writes

Asynchronous replication is faster but may lose data if primary fails

Read replicas distribute read load and improve performance

Geographic replication places data closer to users for lower latency

Replication lag is inevitable with async replication—design for eventual consistency

Failover requires automatic promotion of replica to primary

Monitor replication lag and set up alerts for high lag

Choose based on requirements—consistency vs. performance trade-offs

Multi-master replication is complex but provides no single point of failure

Replication topologies (chain, star, tree) have different trade-offs

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.