Topic Overview

Heartbeats & Health Checks

Learn how to monitor node health and detect failures in distributed systems.

Heartbeats and health checks are mechanisms to detect node failures and monitor system health in distributed systems.


Heartbeats

Periodic messages sent to indicate a node is alive.

Implementation

class HeartbeatSender {
  async start(): Promise<void> {
    setInterval(async () => {
      await this.sendHeartbeat();
    }, this.interval);
  }

  async sendHeartbeat(): Promise<void> {
    await this.coordinator.heartbeat({
      nodeId: this.myId,
      timestamp: Date.now(),
      status: this.getStatus()
    });
  }
}

class HeartbeatReceiver {
  private lastHeartbeat: Map<string, number> = new Map();

  async receiveHeartbeat(heartbeat: Heartbeat): Promise<void> {
    this.lastHeartbeat.set(heartbeat.nodeId, Date.now());
  }

  async checkFailures(): Promise<void> {
    const now = Date.now();
    const timeout = this.heartbeatInterval * 3; // 3x interval

    for (const [nodeId, lastSeen] of this.lastHeartbeat) {
      if (now - lastSeen > timeout) {
        await this.markFailed(nodeId);
      }
    }
  }
}

Health Checks

Endpoints that report node health status.

Liveness vs Readiness

Liveness: Is the process running?

Readiness: Can the process handle requests?

class HealthCheckService {
  async livenessCheck(): Promise<HealthStatus> {
    // Check if process is alive
    return {
      status: 'healthy',
      timestamp: Date.now()
    };
  }

  async readinessCheck(): Promise<HealthStatus> {
    // Check if ready to serve requests
    const checks = {
      database: await this.checkDatabase(),
      cache: await this.checkCache(),
      dependencies: await this.checkDependencies()
    };

    const healthy = Object.values(checks).every(c => c === 'healthy');
    
    return {
      status: healthy ? 'ready' : 'not-ready',
      checks,
      timestamp: Date.now()
    };
  }
}

Examples

Kubernetes Health Checks

// Liveness probe
app.get('/health/live', (req, res) => {
  res.json({ status: 'alive' });
});

// Readiness probe
app.get('/health/ready', async (req, res) => {
  const ready = await checkReadiness();
  res.status(ready ? 200 : 503).json({ status: ready ? 'ready' : 'not-ready' });
});

Leader Election with Heartbeats

class LeaderWithHeartbeat {
  async becomeLeader(): Promise<void> {
    this.isLeader = true;
    
    // Send periodic heartbeats
    setInterval(async () => {
      if (this.isLeader) {
        await this.sendHeartbeat();
      }
    }, this.heartbeatInterval);
  }

  async checkLeader(): Promise<void> {
    const lastHeartbeat = await this.getLastLeaderHeartbeat();
    const timeout = this.heartbeatInterval * 2;
    
    if (Date.now() - lastHeartbeat > timeout) {
      // Leader failed, start election
      await this.startElection();
    }
  }
}

Common Pitfalls

  • Too frequent heartbeats: Network overhead. Fix: Balance frequency with detection time
  • Not handling network delays: False positives. Fix: Use timeout > 3x interval
  • Single health check: May miss issues. Fix: Check multiple components
  • Not failing fast: Unhealthy nodes continue serving. Fix: Remove from load balancer
  • No graceful shutdown: Health check fails during shutdown. Fix: Implement graceful shutdown

Interview Questions

Beginner

Q: What are heartbeats and health checks used for?

A:

Heartbeats: Periodic messages to indicate a node is alive. Used for failure detection.

Health checks: Endpoints that report node health. Used to determine if node can handle requests.

Purpose:

  • Failure detection: Know when nodes fail
  • Load balancing: Route traffic only to healthy nodes
  • Auto-recovery: Restart or replace failed nodes
  • Monitoring: Track system health

Intermediate

Q: How do you implement health checks for a microservice? What's the difference between liveness and readiness?

A:

Liveness: Is the process running?

  • Simple check: Process is alive
  • Use: Kubernetes will restart if fails

Readiness: Can the process handle requests?

  • Comprehensive check: Database, cache, dependencies all working
  • Use: Load balancer routes traffic only if ready

Implementation:

// Liveness: Simple
app.get('/health/live', (req, res) => {
  res.json({ status: 'alive' });
});

// Readiness: Check dependencies
app.get('/health/ready', async (req, res) => {
  const db = await checkDatabase();
  const cache = await checkCache();
  
  if (db && cache) {
    res.json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'not-ready' });
  }
});

Senior

Q: Design a failure detection system for a distributed system with 1000+ nodes. How do you detect failures quickly while minimizing network overhead?

A:

Design:

class ScalableFailureDetection {
  // Hierarchical heartbeats
  class HierarchicalHeartbeat {
    private clusters: Cluster[] = [];
    
    async heartbeat(): Promise<void> {
      // Nodes heartbeat to cluster leader
      await this.clusterLeader.heartbeat(this.myId);
      
      // Cluster leaders heartbeat to regional coordinator
      if (this.isClusterLeader) {
        await this.regionalCoordinator.heartbeat(this.clusterId);
      }
    }
  }

  // Gossip-based failure detection
  class GossipFailureDetection {
    async gossip(): Promise<void> {
      const peer = this.selectRandomPeer();
      
      // Exchange membership and last-seen times
      const myView = this.getMembershipView();
      const peerView = await peer.getMembershipView();
      
      // Merge views
      this.mergeViews(myView, peerView);
      
      // Detect failures
      this.detectFailures();
    }

    detectFailures(): void {
      const now = Date.now();
      const timeout = this.failureTimeout;
      
      for (const [nodeId, lastSeen] of this.membership) {
        if (now - lastSeen > timeout) {
          // Increase suspicion
          this.suspicionLevels.set(nodeId, (this.suspicionLevels.get(nodeId) || 0) + 1);
          
          if (this.suspicionLevels.get(nodeId)! > this.threshold) {
            this.markFailed(nodeId);
          }
        }
      }
    }
  }

  // Adaptive heartbeats
  class AdaptiveHeartbeat {
    async adjustInterval(): Promise<void> {
      const failureRate = this.getRecentFailureRate();
      
      if (failureRate > 0.1) {
        // High failure rate, check more frequently
        this.interval = 1000; // 1 second
      } else {
        // Low failure rate, can check less frequently
        this.interval = 5000; // 5 seconds
      }
    }
  }
}

Optimizations:

  • Hierarchical: Reduce network messages (nodes → cluster → region)
  • Gossip: O(log n) messages instead of O(n)
  • Adaptive: Adjust frequency based on failure rate
  • Sampling: Check subset of nodes, rotate

Key Takeaways

  • Heartbeats indicate nodes are alive, used for failure detection
  • Health checks report node status (liveness vs readiness)
  • Liveness: Process running (restart if fails)
  • Readiness: Can handle requests (route traffic if ready)
  • Failure detection: Use timeouts (3x heartbeat interval)
  • Minimize overhead: Use hierarchical or gossip-based approaches
  • Quick detection: Balance frequency with network overhead

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.