Topic Overview

Replication Lag: Concepts, Trade-offs & Failure Modes

Learn about replication lag in distributed databases and how to handle it.

Intermediate8 min read

Replication lag is the delay between when data is written to the primary and when it's available on replicas.


What is Replication Lag?

Definition: Time difference between primary write and replica update.

Causes:

  • Network latency
  • Replica processing time
  • High write load
  • Network congestion

Impact

Stale reads: Reading from replica may return old data

Inconsistency: Different replicas may have different data

User experience: Users may not see their own writes immediately


Measuring Lag

1class ReplicationMonitor {
2 async measureLag(): Promise<number> {
3 // Get primary's last committed timestamp
4 primaryTime primary

Handling Lag

Read Your Writes

Route user's reads to primary after their writes.

1class ReadYourWrites {
2 private userWrites: Map<string, number> = new Map();
3
4 async write(userId: string, data: any): Promise<void> {
5 await this.primary.write(data);
6 this.userWrites.set(userId, Date.now());
7 }
8
9 async read(userId: string, key: Data

Monotonic Reads

Ensure user always sees newer data (not older).

1class MonotonicReads {
2 private userReadTimestamps: Map<string, number> = new Map();
3
4 async read(userId: string, key: string): Promise<Data> {
5 const lastRead = this.userReadTimestamps.get(userId) || 0;
6 const replicaTime = await this.replica.getLastAppliedTime();
7
8 // Only read if replica is ahead of last read
9 if (replicaTime > lastRead)

Examples

Database Replication

1class ReplicatedDatabase {
2 async handleLag(): Promise<void> {
3 const lag = await this.measureLag();
4
5 if (lag > 1000) { // 1 second
6 // Route critical reads to primary
7 this.usePrimaryForCritical = true;
8 } else {
9 // Can use replicas
10 this.usePrimaryForCritical = false;
11 }
12 }
13}

Common Pitfalls

  • Ignoring lag: Users see stale data. Fix: Monitor and handle lag
  • Not routing critical reads: Important reads go to stale replica. Fix: Route to primary
  • No lag monitoring: Don't know when lag is high. Fix: Monitor continuously
  • Assuming zero lag: Replicas always have some lag. Fix: Design for lag

Interview Questions

Beginner

Q: What is replication lag and why does it matter?

A: Replication lag is the delay between when data is written to the primary database and when it's available on replicas.

Why it matters:

  • Stale reads: Reading from replica may return old data
  • Inconsistency: Users may not see their own writes
  • User experience: Confusing when data doesn't appear immediately

Example: User updates profile, but when they refresh, old data appears (read from replica that hasn't updated yet).


Intermediate

Q: How do you handle replication lag in a read replica setup?

A:

Strategies:

  1. Read your writes: Route user's reads to primary after their writes
  2. Monotonic reads: Ensure user always sees newer data
  3. Lag-aware routing: Route critical reads to primary if lag is high
  4. Wait for replication: Wait for replica to catch up before reading

Implementation:

  • Track user's last write time
  • If recent write, read from primary
  • Otherwise, read from replica
  • Monitor lag and adjust routing

Senior

Q: Design a system that handles replication lag for a social media platform. Users post content that must be visible to followers. How do you ensure consistency while maintaining performance?

A:

Design:

1class SocialMediaReplication {
2 async postContent(userId: string, content: Content): Promise<void> {
3 // Write to primary
4 await this.primary.write(content);
5
6 // Track user's last write
7 this.userWrites.set(userId, Date.now());
8
9 // Async replication to replicas
10 this.replicateAsync(content);
11 }
12
13 async getFeed(userId: string) Content

Optimizations:

  • Caching: Cache recent posts to reduce replica load
  • Fan-out: Write to follower timelines on post (not on read)
  • Eventual consistency: Accept that followers may see posts slightly later

  • Replication lag is inevitable: Network and processing delays cause lag

  • Monitor lag: Continuously measure and alert on high lag

  • Read-your-writes: Route user's reads to primary after their writes

  • Lag-aware routing: Adjust read routing based on lag

  • Trade-offs: Consistency vs performance (primary vs replica reads)

  • Design for lag: Assume replicas are always slightly behind

  • Partition Tolerance - Handling lag during partitions

  • Distributed Transactions - Transaction consistency with lag

  • Fault Tolerance - Handling replica failures

  • Heartbeats & Health Checks - Monitoring replica health

  • Clock Synchronization (NTP, Lamport) - Timestamping for lag measurement

Key Takeaways

Replication lag is inevitable: Network and processing delays cause lag

Monitor lag: Continuously measure and alert on high lag

Read-your-writes: Route user's reads to primary after their writes

Lag-aware routing: Adjust read routing based on lag

Trade-offs: Consistency vs performance (primary vs replica reads)

Design for lag: Assume replicas are always slightly behind


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.