Real Engineering Stories

The Misconfigured Load Balancer That Created a Single Point of Failure

A production incident where a misconfigured load balancer sent all traffic to a single backend server, causing it to crash and take down the service. Learn about load balancer configuration, health checks, and redundancy.

Intermediate20 min read

This is a story about how a simple configuration mistake in a load balancer created a single point of failure that took down our entire service. It's also about why infrastructure configuration matters as much as application code, and how we learned to test our infrastructure changes.

Context

We were running a REST API service that handled user authentication, profile management, and data queries. The system served about 5M requests per day, with traffic distributed across multiple backend servers behind a load balancer.

Original Architecture:

Technology Choices:

Load Balancer: AWS Application Load Balancer (ALB)
Backend: Node.js API servers (3 instances)
Database: PostgreSQL with read replicas
Health Check: HTTP GET /health endpoint

Assumptions Made:

Load balancer would distribute traffic evenly
Health checks would properly detect unhealthy servers
Multiple backend servers provided redundancy

The Incident

10:00 AM

Infrastructure team updated load balancer health check configuration

10:05 AM

Health check path changed from /health to /api/health (typo in config)

10:06 AM

Load balancer marked all servers as unhealthy (404 on health check)

10:07 AM

Load balancer routing algorithm changed to "least connections"

10:08 AM

All traffic routed to API Server 1 (only server that had a connection)

10:10 AM

API Server 1 CPU usage spikes to 100%

10:12 AM

API Server 1 crashes due to memory exhaustion

10:13 AM

Service completely down (no healthy backends)

10:15 AM

On-call engineer paged

10:20 AM

Identified health check misconfiguration

10:25 AM

Health check path corrected

10:30 AM

All servers marked healthy, traffic restored

10:35 AM

Service fully recovered

Symptoms

What We Saw:

Error Rate: Jumped from 0.1% to 100% in 3 minutes
Response Time: All requests timing out
Server Load: API Server 1 at 100% CPU, others at 0%
Health Check Status: All servers showing as unhealthy in load balancer
User Impact: Complete service outage for 20 minutes, ~200K requests failed

How We Detected It:

Alert fired when error rate exceeded 10%
Dashboard showed all servers as unhealthy
Load balancer metrics showed traffic only to one server

Monitoring Gaps:

No alert for load balancer health check failures
No alert for uneven traffic distribution
No alert for single server handling all traffic

Root Cause Analysis

Primary Cause: Misconfigured health check path in load balancer.

What Happened:

Health check path changed from /health to /api/health (typo: missing leading slash)
Load balancer tried to check /api/health but servers only had /health endpoint
All health checks returned 404, marking all servers as unhealthy
Load balancer routing algorithm switched to "least connections" mode
API Server 1 had one existing connection, so all new traffic routed there
API Server 1 received 100% of traffic (5M requests/day = ~58 req/sec)
Server couldn't handle the load and crashed
With no healthy backends, service went completely down

Why It Was So Bad:

No health check validation: Configuration change wasn't tested
No traffic distribution monitoring: We didn't notice uneven routing
No circuit breaker: Load balancer kept sending traffic to failing server
Single point of failure: One server crash took down entire service

Contributing Factors:

Manual configuration change (not automated)
No staging environment for infrastructure changes
No health check endpoint validation
Load balancer routing algorithm not understood

Fix & Mitigation

Immediate Fixes (During Incident):

Corrected health check path: Changed from /api/health to /health
Manually marked servers healthy: Temporarily bypassed health checks
Restarted API Server 1: Brought crashed server back online

Long-Term Improvements:

Health Check Validation:
- Added automated tests for health check endpoints
- Added staging environment for infrastructure changes
- Added health check configuration validation
Traffic Distribution Monitoring:
- Added alert for uneven traffic distribution (alert if one server handles > 60% traffic)
- Added dashboard for load balancer routing metrics
- Added per-server request rate monitoring
Infrastructure as Code:
- Moved load balancer config to Terraform
- Added infrastructure change review process
- Added automated infrastructure testing
Process Improvements:
- Required staging environment testing for all infrastructure changes
- Added runbook for load balancer incidents
- Added infrastructure change approval process

Architecture After Fix

Key Changes:

Validated health check endpoints
Traffic distribution monitoring
Infrastructure as code
Staging environment for infrastructure changes

Key Lessons

Infrastructure configuration matters: A small typo in config can cause complete outages. Test infrastructure changes like code changes.
Monitor traffic distribution: Uneven traffic distribution is a red flag. Alert if one server handles more than 60% of traffic.
Health checks must be correct: Health check failures can cascade into complete outages. Validate health check endpoints.
Use infrastructure as code: Manual configuration changes are error-prone. Use Terraform, CloudFormation, or similar tools.
Test in staging first: Always test infrastructure changes in staging before production.

Interview Takeaways

Common Questions:

"How do you configure load balancers?"
"What happens when a load balancer health check fails?"
"How do you ensure even traffic distribution?"

What Interviewers Are Looking For:

Understanding of load balancer configuration
Knowledge of health check mechanisms
Awareness of traffic distribution patterns
Experience with infrastructure failures

What a Senior Engineer Would Do Differently

From the Start:

Validate health checks: Test health check endpoints before deploying
Monitor traffic distribution: Alert on uneven routing
Use infrastructure as code: Automate configuration, reduce human error
Test in staging: Always test infrastructure changes in staging first
Add redundancy: Multiple load balancers, multiple availability zones

The Real Lesson: Infrastructure is code. Treat configuration changes with the same rigor as application code changes. Test, review, and monitor everything.

FAQs

Q: How do you configure load balancer health checks?

A: Health checks should hit a lightweight endpoint (like /health) that returns 200 OK when the server is healthy. Configure appropriate timeout, interval, and failure threshold values. Always test health check endpoints before deploying.

Q: What happens when all backend servers fail health checks?

A: The load balancer marks all servers as unhealthy and stops routing traffic. This causes a complete service outage. Always have multiple healthy backends and monitor health check status.

Q: How do you ensure even traffic distribution?

A: Monitor request rates per server and alert if distribution is uneven. Use appropriate load balancing algorithms (round-robin, least connections, etc.). Avoid sticky sessions unless necessary.

Q: Should you use multiple load balancers?

A: For high availability, yes. Use multiple load balancers across availability zones. For smaller services, a single load balancer with multiple backends might be sufficient, but monitor closely.

Q: How do you test infrastructure changes?

A: Use staging environments that mirror production. Test health checks, traffic distribution, and failover scenarios. Use infrastructure as code for reproducible testing.

Q: What's the difference between health checks and readiness probes?

A: Health checks determine if a server is alive and should receive traffic. Readiness probes determine if a server is ready to handle requests (might be starting up). Both are important for proper load balancing.

Q: How do you prevent single points of failure in load balancing?

A: Use multiple load balancers, multiple availability zones, multiple backend servers, and proper health checks. Monitor traffic distribution and alert on anomalies. Design for failure, not just success.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design a URL Shortener (TinyURL)

Apply load balancing lessons to design a resilient URL shortener with proper redundancy.

Easy

Design a Notification System

Use insights from infrastructure design to create a reliable notification system.

Medium

Explore More Practice Questions →

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →