Topic Overview
Heartbeats & Health Checks
Learn how to monitor node health and detect failures in distributed systems.
Heartbeats and health checks are mechanisms to detect node failures and monitor system health in distributed systems.
Heartbeats
Periodic messages sent to indicate a node is alive.
Implementation
class HeartbeatSender {
async start(): Promise<void> {
setInterval(async () => {
await this.sendHeartbeat();
}, this.interval);
}
async sendHeartbeat(): Promise<void> {
await this.coordinator.heartbeat({
nodeId: this.myId,
timestamp: Date.now(),
status: this.getStatus()
});
}
}
class HeartbeatReceiver {
private lastHeartbeat: Map<string, number> = new Map();
async receiveHeartbeat(heartbeat: Heartbeat): Promise<void> {
this.lastHeartbeat.set(heartbeat.nodeId, Date.now());
}
async checkFailures(): Promise<void> {
const now = Date.now();
const timeout = this.heartbeatInterval * 3; // 3x interval
for (const [nodeId, lastSeen] of this.lastHeartbeat) {
if (now - lastSeen > timeout) {
await this.markFailed(nodeId);
}
}
}
}
Health Checks
Endpoints that report node health status.
Liveness vs Readiness
Liveness: Is the process running?
Readiness: Can the process handle requests?
class HealthCheckService {
async livenessCheck(): Promise<HealthStatus> {
// Check if process is alive
return {
status: 'healthy',
timestamp: Date.now()
};
}
async readinessCheck(): Promise<HealthStatus> {
// Check if ready to serve requests
const checks = {
database: await this.checkDatabase(),
cache: await this.checkCache(),
dependencies: await this.checkDependencies()
};
const healthy = Object.values(checks).every(c => c === 'healthy');
return {
status: healthy ? 'ready' : 'not-ready',
checks,
timestamp: Date.now()
};
}
}
Examples
Kubernetes Health Checks
// Liveness probe
app.get('/health/live', (req, res) => {
res.json({ status: 'alive' });
});
// Readiness probe
app.get('/health/ready', async (req, res) => {
const ready = await checkReadiness();
res.status(ready ? 200 : 503).json({ status: ready ? 'ready' : 'not-ready' });
});
Leader Election with Heartbeats
class LeaderWithHeartbeat {
async becomeLeader(): Promise<void> {
this.isLeader = true;
// Send periodic heartbeats
setInterval(async () => {
if (this.isLeader) {
await this.sendHeartbeat();
}
}, this.heartbeatInterval);
}
async checkLeader(): Promise<void> {
const lastHeartbeat = await this.getLastLeaderHeartbeat();
const timeout = this.heartbeatInterval * 2;
if (Date.now() - lastHeartbeat > timeout) {
// Leader failed, start election
await this.startElection();
}
}
}
Common Pitfalls
- Too frequent heartbeats: Network overhead. Fix: Balance frequency with detection time
- Not handling network delays: False positives. Fix: Use timeout > 3x interval
- Single health check: May miss issues. Fix: Check multiple components
- Not failing fast: Unhealthy nodes continue serving. Fix: Remove from load balancer
- No graceful shutdown: Health check fails during shutdown. Fix: Implement graceful shutdown
Interview Questions
Beginner
Q: What are heartbeats and health checks used for?
A:
Heartbeats: Periodic messages to indicate a node is alive. Used for failure detection.
Health checks: Endpoints that report node health. Used to determine if node can handle requests.
Purpose:
- Failure detection: Know when nodes fail
- Load balancing: Route traffic only to healthy nodes
- Auto-recovery: Restart or replace failed nodes
- Monitoring: Track system health
Intermediate
Q: How do you implement health checks for a microservice? What's the difference between liveness and readiness?
A:
Liveness: Is the process running?
- Simple check: Process is alive
- Use: Kubernetes will restart if fails
Readiness: Can the process handle requests?
- Comprehensive check: Database, cache, dependencies all working
- Use: Load balancer routes traffic only if ready
Implementation:
// Liveness: Simple
app.get('/health/live', (req, res) => {
res.json({ status: 'alive' });
});
// Readiness: Check dependencies
app.get('/health/ready', async (req, res) => {
const db = await checkDatabase();
const cache = await checkCache();
if (db && cache) {
res.json({ status: 'ready' });
} else {
res.status(503).json({ status: 'not-ready' });
}
});
Senior
Q: Design a failure detection system for a distributed system with 1000+ nodes. How do you detect failures quickly while minimizing network overhead?
A:
Design:
class ScalableFailureDetection {
// Hierarchical heartbeats
class HierarchicalHeartbeat {
private clusters: Cluster[] = [];
async heartbeat(): Promise<void> {
// Nodes heartbeat to cluster leader
await this.clusterLeader.heartbeat(this.myId);
// Cluster leaders heartbeat to regional coordinator
if (this.isClusterLeader) {
await this.regionalCoordinator.heartbeat(this.clusterId);
}
}
}
// Gossip-based failure detection
class GossipFailureDetection {
async gossip(): Promise<void> {
const peer = this.selectRandomPeer();
// Exchange membership and last-seen times
const myView = this.getMembershipView();
const peerView = await peer.getMembershipView();
// Merge views
this.mergeViews(myView, peerView);
// Detect failures
this.detectFailures();
}
detectFailures(): void {
const now = Date.now();
const timeout = this.failureTimeout;
for (const [nodeId, lastSeen] of this.membership) {
if (now - lastSeen > timeout) {
// Increase suspicion
this.suspicionLevels.set(nodeId, (this.suspicionLevels.get(nodeId) || 0) + 1);
if (this.suspicionLevels.get(nodeId)! > this.threshold) {
this.markFailed(nodeId);
}
}
}
}
}
// Adaptive heartbeats
class AdaptiveHeartbeat {
async adjustInterval(): Promise<void> {
const failureRate = this.getRecentFailureRate();
if (failureRate > 0.1) {
// High failure rate, check more frequently
this.interval = 1000; // 1 second
} else {
// Low failure rate, can check less frequently
this.interval = 5000; // 5 seconds
}
}
}
}
Optimizations:
- Hierarchical: Reduce network messages (nodes → cluster → region)
- Gossip: O(log n) messages instead of O(n)
- Adaptive: Adjust frequency based on failure rate
- Sampling: Check subset of nodes, rotate
Key Takeaways
- Heartbeats indicate nodes are alive, used for failure detection
- Health checks report node status (liveness vs readiness)
- Liveness: Process running (restart if fails)
- Readiness: Can handle requests (route traffic if ready)
- Failure detection: Use timeouts (3x heartbeat interval)
- Minimize overhead: Use hierarchical or gossip-based approaches
- Quick detection: Balance frequency with network overhead