Topic Overview
Health Checks & Monitoring: Liveness, Readiness & Alerts
Implement health checks: liveness vs readiness, dependency checks, alerting, and avoiding false positives.
Health Checks & Monitoring
Why Engineers Care About This
Health checks tell you if your service is working. Load balancers use health checks to route traffic away from unhealthy servers. Orchestrators use health checks to restart failed containers. Monitoring systems use health checks to detect outages. Without health checks, you don't know if services are healthy until users complain.
When load balancers route traffic to dead servers, or containers aren't restarted when they fail, or outages aren't detected until users complain, you're hitting health check problems. These problems compound. Without health checks, dead servers receive traffic (causing errors), failed containers aren't restarted (causing downtime), and outages go undetected (affecting users). Good health checks catch problems early and enable automatic recovery.
In interviews, when someone asks "How would you monitor this service?", they're really asking: "Do you understand health checks? Do you know the difference between liveness and readiness? Do you understand that health checks enable automatic recovery?" Most engineers don't. They implement simple health checks (return 200 OK) without checking dependencies or understanding liveness vs readiness.
Core Intuitions You Must Build
-
Liveness probes check if service is running, readiness probes check if service is ready. Liveness probes answer "is the service alive?" If liveness fails, the service is restarted (container is dead, needs restart). Readiness probes answer "is the service ready to serve traffic?" If readiness fails, traffic is stopped (service is alive but not ready, don't send traffic yet). Use both—liveness for restart logic, readiness for traffic routing.
-
Health checks should be fast and lightweight. Health checks are called frequently (every few seconds) by load balancers and orchestrators. If health checks are slow (check database, check external APIs), they become a bottleneck and can cause false negatives (timeouts). Make health checks fast (less than 100ms)—check in-memory state, not external dependencies. Don't check databases or external APIs in health checks—they're too slow.
-
Dependency health checks are separate from service health checks. Service health checks (liveness, readiness) should be fast and check service state. Dependency health checks (database, cache, external APIs) are separate and can be slower. Don't check dependencies in service health checks—they're too slow and cause false negatives. Check dependencies separately (dependency health endpoint) or in readiness probes if needed.
-
Health check endpoints should be public and unauthenticated. Health checks are called by infrastructure (load balancers, orchestrators, monitoring) that may not have authentication. Make health check endpoints public and unauthenticated. Don't require authentication for health checks—it breaks infrastructure automation. Also, don't expose sensitive information in health checks—they're public.
-
Health check monitoring enables proactive problem detection. Health checks are called frequently, providing real-time health status. Monitor health check results—track success rates, response times, and failure patterns. Alert on health check failures or degraded health. This helps you catch problems early, before users are affected. Don't just implement health checks—monitor them.
-
Health checks enable graceful degradation. When dependencies fail, services can degrade gracefully (serve cached data, disable features). Health checks can reflect this—readiness can fail when critical dependencies are down, but liveness can pass (service is alive, just not ready). This enables automatic traffic routing (stop sending traffic when not ready) while service recovers.
Subtopics (Taught Through Real Scenarios)
Liveness vs Readiness Probes
What people usually get wrong:
Engineers often implement a single health check endpoint that returns 200 OK. But liveness and readiness serve different purposes. Liveness checks if service is alive (should it be restarted?). Readiness checks if service is ready (should it receive traffic?). Use both—liveness for restart logic, readiness for traffic routing. Don't use a single health check for both—it causes problems (service is alive but not ready, gets restarted unnecessarily).
How this breaks systems in the real world:
A service had a single health check that checked database connectivity. During startup, the service wasn't ready (database not connected yet), but health check failed. The orchestrator restarted the service (thinking it was dead), causing a restart loop. The fix? Use separate liveness and readiness probes—liveness checks if service is alive (don't check database), readiness checks if service is ready (check database). Now service isn't restarted during startup. But the real lesson is: liveness and readiness serve different purposes. Use both.
What interviewers are really listening for:
They want to hear you talk about liveness vs readiness probes and their different purposes. Junior engineers say "just return 200 OK for health checks." Senior engineers say "liveness checks if service is alive (for restart logic), readiness checks if service is ready (for traffic routing)—use both, don't use a single health check for both." They're testing whether you understand that health checks serve different purposes.
Fast and Lightweight Health Checks
What people usually get wrong:
Engineers often check dependencies (database, cache, external APIs) in health checks. But health checks are called frequently (every few seconds), and checking dependencies is slow (100-500ms). Slow health checks become a bottleneck and can cause false negatives (timeouts). Make health checks fast (less than 100ms)—check in-memory state, not external dependencies.
How this breaks systems in the real world:
A service checked database connectivity in health checks. Health checks took 200ms (database query). During database slowness, health checks timed out (500ms timeout). Load balancers marked the service as unhealthy and stopped sending traffic, even though the service was fine (just database was slow). The fix? Make health checks fast—check in-memory state only, don't check database. Check database separately (readiness probe or dependency health endpoint). But the real lesson is: health checks should be fast. Don't check slow dependencies in health checks.
What interviewers are really listening for:
They want to hear you talk about fast health checks, avoiding dependency checks, and false negatives. Junior engineers say "just check database in health checks." Senior engineers say "health checks should be fast (less than 100ms)—check in-memory state, not external dependencies. Slow health checks cause false negatives and become bottlenecks." They're testing whether you understand that health checks are about speed, not comprehensiveness.
Health Check Monitoring
What people usually get wrong:
Engineers often implement health checks but don't monitor them. But health checks provide real-time health status—monitoring them helps you catch problems early. Track health check success rates, response times, and failure patterns. Alert on health check failures or degraded health. This enables proactive problem detection, before users are affected.
How this breaks systems in the real world:
A service had health checks but no monitoring. Health checks were failing (service was unhealthy), but no one knew. Users started experiencing errors, and the team had to investigate. By then, the problem had been happening for hours, affecting many users. The fix? Monitor health checks—track success rates, response times, and alert on failures. Now problems are caught early, before users are affected. But the real lesson is: health checks require monitoring. Without monitoring, you're blind to health issues.
What interviewers are really listening for:
They want to hear you talk about health check monitoring, metrics to track, and alerting. Junior engineers say "health checks just work." Senior engineers say "monitor health checks—track success rates, response times, and alert on failures or degraded health to catch problems early." They're testing whether you understand that health checks are about observability, not just "returning status."
- Liveness probes check if service is running, readiness probes check if service is ready—use both for different purposes
- Health checks should be fast and lightweight—check in-memory state, not external dependencies
- Dependency health checks are separate—don't check dependencies in service health checks
- Health check endpoints should be public and unauthenticated—called by infrastructure
- Health check monitoring enables proactive problem detection—track success rates and alert on failures
- Health checks enable graceful degradation—readiness can fail when dependencies are down
- Good health checks catch problems early and enable automatic recovery
- Error Handling & Logging - Handling health check failures
- System Design - Designing systems with health checks
- API Design - Designing health check endpoints
Key Takeaways
Liveness probes check if service is running, readiness probes check if service is ready—use both for different purposes
Health checks should be fast and lightweight—check in-memory state, not external dependencies
Dependency health checks are separate—don't check dependencies in service health checks
Health check endpoints should be public and unauthenticated—called by infrastructure
Health check monitoring enables proactive problem detection—track success rates and alert on failures
Health checks enable graceful degradation—readiness can fail when dependencies are down
Good health checks catch problems early and enable automatic recovery