Topic Overview
Heartbeats & Health Checks
Learn how to monitor node health and detect failures in distributed systems.
Heartbeats and health checks are mechanisms to detect node failures and monitor system health in distributed systems.
Heartbeats
Periodic messages sent to indicate a node is alive.
Implementation
1class HeartbeatSender {2 async start(): Promise<void> {3 setInterval(async () => {4 await this.sendHeartbeat();5 }, this.interval);6 }
Health Checks
Endpoints that report node health status.
Liveness vs Readiness
Liveness: Is the process running?
Readiness: Can the process handle requests?
1class HealthCheckService {2 async livenessCheck(): Promise<HealthStatus> {3 // Check if process is alive4 return {5 status: 'healthy',6 timestamp: Date.now()7 };8 }910 async readinessCheck(): Promise<HealthStatus> {11 // Check if ready to serve requests12 const checks = {13 database: await this.checkDatabase(),14 cache: await this.checkCache
Examples
Kubernetes Health Checks
1// Liveness probe2app.get('/health/live', (req, res) => {3 res.json({ status: 'alive' });4});56// Readiness probe7app.get('/health/ready', async (req, res) => {8 const ready = await checkReadiness();9 res.status(ready ? 200 : 503).json({ status: ready ? 'ready' :
Leader Election with Heartbeats
1class LeaderWithHeartbeat {2 async becomeLeader(): Promise<void> {3 this.isLeader = true;45 // Send periodic heartbeats6 setInterval(async () => {7 if (this.isLeader) {8 await this.sendHeartbeat();9 }10 }, this.heartbeatInterval);11 }1213 async checkLeader(): Promise<void
Common Pitfalls
- Too frequent heartbeats: Network overhead. Fix: Balance frequency with detection time
- Not handling network delays: False positives. Fix: Use timeout > 3x interval
- Single health check: May miss issues. Fix: Check multiple components
- Not failing fast: Unhealthy nodes continue serving. Fix: Remove from load balancer
- No graceful shutdown: Health check fails during shutdown. Fix: Implement graceful shutdown
Interview Questions
Beginner
Q: What are heartbeats and health checks used for?
A:
Heartbeats: Periodic messages to indicate a node is alive. Used for failure detection.
Health checks: Endpoints that report node health. Used to determine if node can handle requests.
Purpose:
- Failure detection: Know when nodes fail
- Load balancing: Route traffic only to healthy nodes
- Auto-recovery: Restart or replace failed nodes
- Monitoring: Track system health
Intermediate
Q: How do you implement health checks for a microservice? What's the difference between liveness and readiness?
A:
Liveness: Is the process running?
- Simple check: Process is alive
- Use: Kubernetes will restart if fails
Readiness: Can the process handle requests?
- Comprehensive check: Database, cache, dependencies all working
- Use: Load balancer routes traffic only if ready
Implementation:
1// Liveness: Simple2app.get('/health/live', (req, res) => {3 res.json({ status: 'alive' });4});56// Readiness: Check dependencies7app.get('/health/ready', async (req, res) => {8 const db = await checkDatabase();9 const cache = await checkCache();1011 if (db && cache) {12 res. status
Senior
Q: Design a failure detection system for a distributed system with 1000+ nodes. How do you detect failures quickly while minimizing network overhead?
A:
Design:
1class ScalableFailureDetection {2 // Hierarchical heartbeats3 class HierarchicalHeartbeat {4 private clusters: Cluster[] = [];56 async heartbeat(): Promise<void> {7 // Nodes heartbeat to cluster leader8 await this.clusterLeader.heartbeat(this.myId);910 // Cluster leaders heartbeat to regional coordinator11 if (this.isClusterLeader) {12 await this.regionalCoordinator.heartbeat(this.clusterId)
Optimizations:
- Hierarchical: Reduce network messages (nodes → cluster → region)
- Gossip: O(log n) messages instead of O(n)
- Adaptive: Adjust frequency based on failure rate
- Sampling: Check subset of nodes, rotate
-
Heartbeats indicate nodes are alive, used for failure detection
-
Health checks report node status (liveness vs readiness)
-
Liveness: Process running (restart if fails)
-
Readiness: Can handle requests (route traffic if ready)
-
Failure detection: Use timeouts (3x heartbeat interval)
-
Minimize overhead: Use hierarchical or gossip-based approaches
-
Quick detection: Balance frequency with network overhead
-
Fault Tolerance - Using health checks for fault tolerance
-
Leader Election - Detecting leader failures with heartbeats
-
Gossip Protocol - Alternative to heartbeats for membership
-
Distributed Logging - Logging health check events
-
Replication Lag - Health checks for replica status
Key Takeaways
Heartbeats indicate nodes are alive, used for failure detection
Health checks report node status (liveness vs readiness)
Liveness: Process running (restart if fails)
Readiness: Can handle requests (route traffic if ready)
Failure detection: Use timeouts (3x heartbeat interval)
Minimize overhead: Use hierarchical or gossip-based approaches
Quick detection: Balance frequency with network overhead
What's next?