Topic Overview
Replication Lag: Concepts, Trade-offs & Failure Modes
Learn about replication lag in distributed databases and how to handle it.
Replication lag is the delay between when data is written to the primary and when it's available on replicas.
What is Replication Lag?
Definition: Time difference between primary write and replica update.
Causes:
- Network latency
- Replica processing time
- High write load
- Network congestion
Impact
Stale reads: Reading from replica may return old data
Inconsistency: Different replicas may have different data
User experience: Users may not see their own writes immediately
Measuring Lag
1class ReplicationMonitor {2 async measureLag(): Promise<number> {3 // Get primary's last committed timestamp4 primaryTime primary
Handling Lag
Read Your Writes
Route user's reads to primary after their writes.
1class ReadYourWrites {2 private userWrites: Map<string, number> = new Map();34 async write(userId: string, data: any): Promise<void> {5 await this.primary.write(data);6 this.userWrites.set(userId, Date.now());7 }89 async read(userId: string, key: Data
Monotonic Reads
Ensure user always sees newer data (not older).
1class MonotonicReads {2 private userReadTimestamps: Map<string, number> = new Map();34 async read(userId: string, key: string): Promise<Data> {5 const lastRead = this.userReadTimestamps.get(userId) || 0;6 const replicaTime = await this.replica.getLastAppliedTime();78 // Only read if replica is ahead of last read9 if (replicaTime > lastRead)
Examples
Database Replication
1class ReplicatedDatabase {2 async handleLag(): Promise<void> {3 const lag = await this.measureLag();45 if (lag > 1000) { // 1 second6 // Route critical reads to primary7 this.usePrimaryForCritical = true;8 } else {9 // Can use replicas10 this.usePrimaryForCritical = false;11 }12 }13}
Common Pitfalls
- Ignoring lag: Users see stale data. Fix: Monitor and handle lag
- Not routing critical reads: Important reads go to stale replica. Fix: Route to primary
- No lag monitoring: Don't know when lag is high. Fix: Monitor continuously
- Assuming zero lag: Replicas always have some lag. Fix: Design for lag
Interview Questions
Beginner
Q: What is replication lag and why does it matter?
A: Replication lag is the delay between when data is written to the primary database and when it's available on replicas.
Why it matters:
- Stale reads: Reading from replica may return old data
- Inconsistency: Users may not see their own writes
- User experience: Confusing when data doesn't appear immediately
Example: User updates profile, but when they refresh, old data appears (read from replica that hasn't updated yet).
Intermediate
Q: How do you handle replication lag in a read replica setup?
A:
Strategies:
- Read your writes: Route user's reads to primary after their writes
- Monotonic reads: Ensure user always sees newer data
- Lag-aware routing: Route critical reads to primary if lag is high
- Wait for replication: Wait for replica to catch up before reading
Implementation:
- Track user's last write time
- If recent write, read from primary
- Otherwise, read from replica
- Monitor lag and adjust routing
Senior
Q: Design a system that handles replication lag for a social media platform. Users post content that must be visible to followers. How do you ensure consistency while maintaining performance?
A:
Design:
1class SocialMediaReplication {2 async postContent(userId: string, content: Content): Promise<void> {3 // Write to primary4 await this.primary.write(content);56 // Track user's last write7 this.userWrites.set(userId, Date.now());89 // Async replication to replicas10 this.replicateAsync(content);11 }1213 async getFeed(userId: string) Content
Optimizations:
- Caching: Cache recent posts to reduce replica load
- Fan-out: Write to follower timelines on post (not on read)
- Eventual consistency: Accept that followers may see posts slightly later
-
Replication lag is inevitable: Network and processing delays cause lag
-
Monitor lag: Continuously measure and alert on high lag
-
Read-your-writes: Route user's reads to primary after their writes
-
Lag-aware routing: Adjust read routing based on lag
-
Trade-offs: Consistency vs performance (primary vs replica reads)
-
Design for lag: Assume replicas are always slightly behind
-
Partition Tolerance - Handling lag during partitions
-
Distributed Transactions - Transaction consistency with lag
-
Fault Tolerance - Handling replica failures
-
Heartbeats & Health Checks - Monitoring replica health
-
Clock Synchronization (NTP, Lamport) - Timestamping for lag measurement
Key Takeaways
Replication lag is inevitable: Network and processing delays cause lag
Monitor lag: Continuously measure and alert on high lag
Read-your-writes: Route user's reads to primary after their writes
Lag-aware routing: Adjust read routing based on lag
Trade-offs: Consistency vs performance (primary vs replica reads)
Design for lag: Assume replicas are always slightly behind
What's next?