Database Topic
Data Replication
Master database replication strategies for high availability and performance. Essential for distributed systems and database interviews.
Data Replication
Why This Matters
Think of data replication like having backup copies of important documents. If your original document is lost or damaged, you have copies. Database replication works the same way—if your primary database fails, you have replicas that can take over. Also, you can distribute read load across replicas, improving performance.
This matters because databases are single points of failure. If your database goes down, your entire application goes down. Replication provides high availability—if the primary fails, a replica can take over. It also improves performance—you can distribute read queries across replicas, reducing load on the primary.
In interviews, when someone asks "How would you design a highly available system?", they're testing whether you understand replication. Do you know the difference between synchronous and asynchronous replication? Do you understand replication lag? Most engineers don't. They just enable replication and wonder why data is inconsistent.
What Engineers Usually Get Wrong
Most engineers think "replication means automatic failover." But replication just copies data—it doesn't automatically failover. You need additional mechanisms (health checks, failover logic) to detect failures and switch to replicas. Also, failover isn't instant—it takes time to detect failures and switch, during which the system might be unavailable.
Engineers also don't understand replication lag. With asynchronous replication, replicas are always slightly behind the primary (seconds to minutes). If you read from a replica immediately after writing to the primary, you might not see your changes. This is why "read your writes" consistency is hard with replication.
How This Breaks Systems in the Real World
A service was using asynchronous replication. Users would write data to the primary, then immediately read from a replica. But the replica hadn't received the update yet (replication lag). Users saw stale data and thought their writes failed. This created confusion and support tickets. The fix? Read from the primary immediately after writes, or use synchronous replication for critical operations. But synchronous replication has overhead—it slows down writes.
Another story: A service was using replication for high availability, but didn't have automatic failover. When the primary database failed, the service was down for 30 minutes while the team manually failed over to a replica. During this time, the service was completely unavailable. The fix? Implement automatic failover with health checks. Detect failures quickly and switch to replicas automatically.
Why Replicate?
High Availability
If one server fails, others can continue serving requests.
Primary Server (fails)
↓
Secondary Server (takes over)
Performance
Distribute read load across multiple servers.
Read Request → Load Balancer → Replica 1, 2, or 3
Write Request → Primary Server
Geographic Distribution
Place data closer to users.
US Users → US Replica
EU Users → EU Replica
Asia Users → Asia Replica
Disaster Recovery
Backup copies in different locations.
Replication Models
Master-Slave (Primary-Replica)
One primary server accepts writes, replicas handle reads.
Primary (Writes + Reads)
↓ (replication)
Replica 1 (Reads only)
Replica 2 (Reads only)
Replica 3 (Reads only)
Characteristics:
- Simple to understand
- Single point of failure (primary)
- Asynchronous or synchronous replication
- Replicas can lag behind primary
Use case: Read scaling, backup
Master-Master (Multi-Master)
Multiple servers can accept writes.
Master 1 (Writes + Reads) ←→ Master 2 (Writes + Reads)
↓ ↓
Replica 1 Replica 2
Characteristics:
- No single point of failure
- More complex conflict resolution
- Can have write conflicts
- Requires careful design
Use case: Multi-region deployments, high availability
Replication Methods
Statement-Based Replication (SBR)
Replicate SQL statements.
-- Primary executes:
UPDATE users SET balance = balance - 100 WHERE id = 1;
-- Replica executes same statement:
UPDATE users SET balance = balance - 100 WHERE id = 1;
Pros:
- Compact (only SQL statements)
- Easy to understand
Cons:
- Non-deterministic functions (NOW(), RAND())
- Stored procedures/triggers can cause issues
Row-Based Replication (RBR)
Replicate changed rows.
Primary: Row (id=1, balance=900) changed to (id=1, balance=800)
Replica: Apply same row change
Pros:
- More reliable (exact data changes)
- Handles non-deterministic functions
Cons:
- Larger log size
- More network bandwidth
Mixed Replication
Uses statement-based by default, switches to row-based when needed.
Synchronization Modes
Synchronous Replication
Primary waits for replica to confirm before committing.
1. Primary writes to log
2. Primary sends to replica
3. Replica writes to log
4. Replica confirms to primary
5. Primary commits transaction
Pros:
- No data loss if primary fails
- Strong consistency
Cons:
- Slower writes (waits for replica)
- Replica failure blocks primary
Use case: Financial systems, critical data
Asynchronous Replication
Primary doesn't wait for replica confirmation.
1. Primary writes to log
2. Primary commits transaction
3. Primary sends to replica (async)
4. Replica applies changes
Pros:
- Fast writes
- Primary not blocked by replica issues
Cons:
- Possible data loss if primary fails
- Replica lag (eventual consistency)
Use case: Most web applications, read scaling
Semi-Synchronous Replication
Primary waits for at least one replica (not all).
Primary waits for at least 1 of N replicas to confirm
Balance: Between sync and async
Replication Topologies
Single Replica
Primary → Replica
Simple, but single point of failure for replica.
Chain Replication
Primary → Replica 1 → Replica 2 → Replica 3
Reduces load on primary, but longer replication chain.
Star Topology
Primary
/ | \
Replica1 Replica2 Replica3
All replicas get updates directly from primary.
Tree Topology
Primary
/ \
Replica1 Replica2
| |
Replica3 Replica4
Hierarchical structure, good for geographic distribution.
Consistency Models
Strong Consistency
All replicas have same data at all times.
Write to Primary → Immediately visible on all replicas
Challenge: Requires synchronous replication, slower
Eventual Consistency
Replicas will converge to same state, but may differ temporarily.
Write to Primary → Eventually visible on all replicas
Read from Replica → May see slightly stale data
Challenge: Application must handle stale reads
Read Your Writes
User always sees their own writes, even if reading from replica.
User writes → Primary
User reads → Primary (or replica with user's writes)
Implementation: Route user's reads to primary or wait for replica to catch up
Common Patterns
Read Replicas
Use replicas for read-heavy workloads.
# Write to primary
db.write(user_data) # Goes to primary
# Read from replica
user = replica_db.read(user_id) # Goes to replica
Failover
Automatically promote replica to primary if primary fails.
1. Monitor primary health
2. Detect failure
3. Promote replica to primary
4. Update DNS/load balancer
5. Application reconnects to new primary
Geographic Replication
Replicate to different regions.
US Primary → EU Replica → Asia Replica
Benefits:
- Lower latency for local users
- Disaster recovery
- Compliance (data residency)
Database-Specific Examples
PostgreSQL Streaming Replication
-- On primary: Enable WAL archiving
wal_level = replica
max_wal_senders = 3
-- On replica: Configure replication
primary_conninfo = 'host=primary.example.com port=5432 user=replicator'
MySQL Replication
-- On primary: Enable binary logging
[mysqld]
log-bin=mysql-bin
server-id=1
-- On replica: Configure replication
CHANGE MASTER TO
MASTER_HOST='primary.example.com',
MASTER_USER='replicator',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=154;
MongoDB Replica Set
// Initialize replica set
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "primary:27017" },
{ _id: 1, host: "replica1:27017" },
{ _id: 2, host: "replica2:27017" }
]
});
Monitoring Replication
Key Metrics
- Replication lag: How far behind replicas are
- Throughput: Replication rate (bytes/second)
- Errors: Replication failures
- Health: Replica status (up/down)
Monitoring Queries
-- PostgreSQL: Check replication lag
SELECT
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as lag_bytes
FROM pg_stat_replication;
-- MySQL: Check replication status
SHOW SLAVE STATUS\G
Best Practices
- Monitor lag: Set up alerts for replication lag
- Test failover: Regularly test failover procedures
- Backup replicas: Use replicas for backups to reduce primary load
- Geographic distribution: Place replicas close to users
- Consistent backups: Backup from replicas, not primary
- Handle stale reads: Design application to tolerate eventual consistency
Common Challenges
Replication Lag
Replicas fall behind primary, causing stale reads.
Solutions:
- Monitor and alert on lag
- Route critical reads to primary
- Use read-your-writes consistency
Split-Brain
Multiple primaries think they're the primary.
Solutions:
- Use quorum-based election
- Proper failover procedures
- Network partitioning handling
Write Conflicts
In multi-master setups, simultaneous writes to same data.
Solutions:
- Conflict resolution strategies
- Application-level conflict handling
- Last-write-wins or merge strategies
Trade-offs
Synchronous Replication
- Pros: No data loss, strong consistency
- Cons: Slower writes, primary blocked by replica issues
Asynchronous Replication
- Pros: Fast writes, primary not blocked
- Cons: Possible data loss, eventual consistency
Multiple Replicas
- Pros: High availability, read scaling
- Cons: More complexity, more resources, consistency challenges
Interview Questions
1. Beginner Question
Q: What is database replication, and why is it important?
A: Database replication is the process of copying and maintaining database data across multiple servers.
Why important:
- High availability: If one server fails, others can continue serving
- Performance: Distribute read load across multiple servers
- Geographic distribution: Place data closer to users
- Disaster recovery: Backup copies in different locations
Example:
Primary Server (handles writes)
↓ (replication)
Replica 1 (handles reads)
Replica 2 (handles reads)
Replica 3 (handles reads)
Benefits:
- Reads can be distributed across replicas (read scaling)
- If primary fails, a replica can be promoted (high availability)
- Replicas can be in different regions (lower latency)
2. Intermediate Question
Q: Explain the difference between synchronous and asynchronous replication. When would you use each?
A:
Synchronous replication:
- Primary waits for replica to confirm before committing
- Pros: No data loss if primary fails, strong consistency
- Cons: Slower writes (waits for replica), primary blocked if replica fails
- Use when: Financial systems, critical data where consistency > performance
Asynchronous replication:
- Primary doesn't wait for replica confirmation
- Pros: Fast writes, primary not blocked by replica issues
- Cons: Possible data loss if primary fails, eventual consistency (replica lag)
- Use when: Most web applications, read scaling, performance > strong consistency
Example:
-- Synchronous: Primary waits
Primary: Write → Send to Replica → Wait for confirmation → Commit
-- If replica is slow, write is slow
-- Asynchronous: Primary doesn't wait
Primary: Write → Commit → Send to Replica (async)
-- Write is fast, but replica may lag behind
Trade-off: Consistency vs. Performance. Choose based on requirements.
3. Senior-Level System Question
Q: Design a globally distributed database system for a social media platform with 1B users. The system needs to support: low latency worldwide, high availability, and handle 10M writes/second. How would you implement replication?
A:
Multi-region replication strategy:
-
Regional primaries with cross-region replication:
US Primary → US Replicas (3) ↓ (async replication) EU Primary → EU Replicas (3) ↓ (async replication) Asia Primary → Asia Replicas (3) -
Sharding by user_id:
# Route users to their region def get_user_shard(user_id): region = get_user_region(user_id) # Based on signup location return f"{region}_primary" -
Replication topology:
- Within region: Synchronous replication (low latency, strong consistency)
- Cross-region: Asynchronous replication (high latency, eventual consistency)
-
Read strategy:
def read_user_data(user_id): user_region = get_user_region(user_id) # Read from local replica (low latency) replica = get_replica(user_region) return replica.query_user(user_id) -
Write strategy:
def write_user_data(user_id, data): user_region = get_user_region(user_id) primary = get_primary(user_region) # Write to local primary (synchronous within region) primary.write_user(user_id, data) # Replicate to other regions (asynchronous) async_replicate_to_other_regions(user_id, data) -
Conflict resolution:
# Last-write-wins with vector clocks def resolve_conflict(user_data_us, user_data_eu): if user_data_us.timestamp > user_data_eu.timestamp: return user_data_us else: return user_data_eu -
Failover strategy:
# Automatic failover within region def handle_primary_failure(region): # Promote replica to primary new_primary = elect_replica(region) # Update DNS/load balancer update_routing(region, new_primary) # Notify other regions notify_regions_of_failover(region, new_primary) -
Monitoring:
- Replication lag (alert if > 5 seconds)
- Primary health (heartbeat monitoring)
- Cross-region latency
- Conflict rates
Trade-offs:
- Consistency: Eventual consistency across regions (acceptable for social media)
- Latency: Low latency within region, higher across regions
- Complexity: Multi-region replication adds operational complexity
- Cost: Multiple replicas and cross-region bandwidth
Failure Stories You'll Recognize
The Replication Lag: A service was using asynchronous replication. Users would write data to the primary, then immediately read from a replica. But the replica hadn't received the update yet (replication lag). Users saw stale data and thought their writes failed. This created confusion and support tickets. The fix? Read from the primary immediately after writes, or use synchronous replication for critical operations. But synchronous replication has overhead—it slows down writes.
The Manual Failover: A service was using replication for high availability, but didn't have automatic failover. When the primary database failed, the service was down for 30 minutes while the team manually failed over to a replica. During this time, the service was completely unavailable. The fix? Implement automatic failover with health checks. Detect failures quickly and switch to replicas automatically.
The Split-Brain: A service was using multi-master replication. Two masters were both accepting writes. When the network partitioned, each master thought it was the only one. Both accepted writes, creating conflicting data. When the network recovered, the databases had inconsistent data. The fix? Use a single master for writes, or implement conflict resolution for multi-master setups.
What Interviewers Are Really Testing
They want to hear you talk about replication as a trade-off between consistency and performance. Junior engineers say "replication provides high availability." Senior engineers say "replication provides high availability and can improve read performance. But async replication has lag, and sync replication slows writes. Choose based on your requirements. Monitor replication lag and implement automatic failover."
When they ask "How would you design a highly available system?", they're testing:
-
Do you understand replication strategies?
-
Do you know the trade-offs between sync and async replication?
-
Can you handle replication lag and failover?
-
Replication provides high availability—if one server fails, others can continue
-
Master-slave (primary-replica) is most common—primary handles writes, replicas handle reads
-
Synchronous replication ensures no data loss but slows writes
-
Asynchronous replication is faster but may lose data if primary fails
-
Read replicas distribute read load and improve performance
-
Geographic replication places data closer to users for lower latency
-
Replication lag is inevitable with async replication—design for eventual consistency
-
Failover requires automatic promotion of replica to primary
-
Monitor replication lag and set up alerts for high lag
-
Choose based on requirements—consistency vs. performance trade-offs
-
Multi-master replication is complex but provides no single point of failure
-
Replication topologies (chain, star, tree) have different trade-offs
How InterviewCrafted Will Teach This
We'll teach this through production failures, not definitions. Instead of memorizing "replication copies data," you'll learn through scenarios like "why did users see stale data immediately after writing?"
You'll see how replication affects availability, performance, and consistency. When an interviewer asks "how would you design a highly available system?", you'll think about replication strategies, lag, and failover—not just "use replication."
- Transactions - Replication must maintain transaction consistency. Understanding transactions helps understand replication guarantees.
- ACID Properties - Replication affects durability and consistency. Understanding ACID helps understand replication trade-offs.
- Isolation Levels - Isolation levels affect replication lag and consistency. Understanding isolation helps design replication strategies.
- Partitioning - Partitioned tables can be replicated independently. Understanding partitioning helps design replicated systems.
- NoSQL Basics - Many NoSQL databases use replication for availability. Understanding NoSQL helps understand replication patterns.
Key Takeaways
Replication provides high availability—if one server fails, others can continue
Master-slave (primary-replica) is most common—primary handles writes, replicas handle reads
Synchronous replication ensures no data loss but slows writes
Asynchronous replication is faster but may lose data if primary fails
Read replicas distribute read load and improve performance
Geographic replication places data closer to users for lower latency
Replication lag is inevitable with async replication—design for eventual consistency
Failover requires automatic promotion of replica to primary
Monitor replication lag and set up alerts for high lag
Choose based on requirements—consistency vs. performance trade-offs
Multi-master replication is complex but provides no single point of failure
Replication topologies (chain, star, tree) have different trade-offs
Related Topics
Transactions
Replication must maintain transaction consistency. Understanding transactions helps understand replication guarantees.
ACID Properties
Replication affects durability and consistency. Understanding ACID helps understand replication trade-offs.
Isolation Levels
Isolation levels affect replication lag and consistency. Understanding isolation helps design replication strategies.
Partitioning
Partitioned tables can be replicated independently. Understanding partitioning helps design replicated systems.
NoSQL Basics
Many NoSQL databases use replication for availability. Understanding NoSQL helps understand replication patterns.
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.