Topic Overview

Heartbeats & Health Checks

Learn how to monitor node health and detect failures in distributed systems.

Intermediate8 min read

Heartbeats and health checks are mechanisms to detect node failures and monitor system health in distributed systems.


Heartbeats

Periodic messages sent to indicate a node is alive.

Implementation

1class HeartbeatSender {
2 async start(): Promise<void> {
3 setInterval(async () => {
4 await this.sendHeartbeat();
5 }, this.interval);
6 }

Health Checks

Endpoints that report node health status.

Liveness vs Readiness

Liveness: Is the process running?

Readiness: Can the process handle requests?

1class HealthCheckService {
2 async livenessCheck(): Promise<HealthStatus> {
3 // Check if process is alive
4 return {
5 status: 'healthy',
6 timestamp: Date.now()
7 };
8 }
9
10 async readinessCheck(): Promise<HealthStatus> {
11 // Check if ready to serve requests
12 const checks = {
13 database: await this.checkDatabase(),
14 cache: await this.checkCache

Examples

Kubernetes Health Checks

1// Liveness probe
2app.get('/health/live', (req, res) => {
3 res.json({ status: 'alive' });
4});
5
6// Readiness probe
7app.get('/health/ready', async (req, res) => {
8 const ready = await checkReadiness();
9 res.status(ready ? 200 : 503).json({ status: ready ? 'ready' :

Leader Election with Heartbeats

1class LeaderWithHeartbeat {
2 async becomeLeader(): Promise<void> {
3 this.isLeader = true;
4
5 // Send periodic heartbeats
6 setInterval(async () => {
7 if (this.isLeader) {
8 await this.sendHeartbeat();
9 }
10 }, this.heartbeatInterval);
11 }
12
13 async checkLeader(): Promise<void

Common Pitfalls

  • Too frequent heartbeats: Network overhead. Fix: Balance frequency with detection time
  • Not handling network delays: False positives. Fix: Use timeout > 3x interval
  • Single health check: May miss issues. Fix: Check multiple components
  • Not failing fast: Unhealthy nodes continue serving. Fix: Remove from load balancer
  • No graceful shutdown: Health check fails during shutdown. Fix: Implement graceful shutdown

Interview Questions

Beginner

Q: What are heartbeats and health checks used for?

A:

Heartbeats: Periodic messages to indicate a node is alive. Used for failure detection.

Health checks: Endpoints that report node health. Used to determine if node can handle requests.

Purpose:

  • Failure detection: Know when nodes fail
  • Load balancing: Route traffic only to healthy nodes
  • Auto-recovery: Restart or replace failed nodes
  • Monitoring: Track system health

Intermediate

Q: How do you implement health checks for a microservice? What's the difference between liveness and readiness?

A:

Liveness: Is the process running?

  • Simple check: Process is alive
  • Use: Kubernetes will restart if fails

Readiness: Can the process handle requests?

  • Comprehensive check: Database, cache, dependencies all working
  • Use: Load balancer routes traffic only if ready

Implementation:

1// Liveness: Simple
2app.get('/health/live', (req, res) => {
3 res.json({ status: 'alive' });
4});
5
6// Readiness: Check dependencies
7app.get('/health/ready', async (req, res) => {
8 const db = await checkDatabase();
9 const cache = await checkCache();
10
11 if (db && cache) {
12 res. status

Senior

Q: Design a failure detection system for a distributed system with 1000+ nodes. How do you detect failures quickly while minimizing network overhead?

A:

Design:

1class ScalableFailureDetection {
2 // Hierarchical heartbeats
3 class HierarchicalHeartbeat {
4 private clusters: Cluster[] = [];
5
6 async heartbeat(): Promise<void> {
7 // Nodes heartbeat to cluster leader
8 await this.clusterLeader.heartbeat(this.myId);
9
10 // Cluster leaders heartbeat to regional coordinator
11 if (this.isClusterLeader) {
12 await this.regionalCoordinator.heartbeat(this.clusterId)

Optimizations:

  • Hierarchical: Reduce network messages (nodes → cluster → region)
  • Gossip: O(log n) messages instead of O(n)
  • Adaptive: Adjust frequency based on failure rate
  • Sampling: Check subset of nodes, rotate

  • Heartbeats indicate nodes are alive, used for failure detection

  • Health checks report node status (liveness vs readiness)

  • Liveness: Process running (restart if fails)

  • Readiness: Can handle requests (route traffic if ready)

  • Failure detection: Use timeouts (3x heartbeat interval)

  • Minimize overhead: Use hierarchical or gossip-based approaches

  • Quick detection: Balance frequency with network overhead

  • Fault Tolerance - Using health checks for fault tolerance

  • Leader Election - Detecting leader failures with heartbeats

  • Gossip Protocol - Alternative to heartbeats for membership

  • Distributed Logging - Logging health check events

  • Replication Lag - Health checks for replica status

Key Takeaways

Heartbeats indicate nodes are alive, used for failure detection

Health checks report node status (liveness vs readiness)

Liveness: Process running (restart if fails)

Readiness: Can handle requests (route traffic if ready)

Failure detection: Use timeouts (3x heartbeat interval)

Minimize overhead: Use hierarchical or gossip-based approaches

Quick detection: Balance frequency with network overhead


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.