← Back to Real Engineering Stories

Real Engineering Stories

The Misconfigured Load Balancer That Created a Single Point of Failure

A production incident where a misconfigured load balancer sent all traffic to a single backend server, causing it to crash and take down the service. Learn about load balancer configuration, health checks, and redundancy.

Intermediate20 min read

This is a story about how a simple configuration mistake in a load balancer created a single point of failure that took down our entire service. It's also about why infrastructure configuration matters as much as application code, and how we learned to test our infrastructure changes.


Context

We were running a REST API service that handled user authentication, profile management, and data queries. The system served about 5M requests per day, with traffic distributed across multiple backend servers behind a load balancer.

Original Architecture:

graph TB
    Client[Client] --> LB[Load Balancer<br/>AWS ALB]
    LB --> API1[API Server 1]
    LB --> API2[API Server 2]
    LB --> API3[API Server 3]
    API1 --> DB[(Database)]
    API2 --> DB
    API3 --> DB

Technology Choices:

  • Load Balancer: AWS Application Load Balancer (ALB)
  • Backend: Node.js API servers (3 instances)
  • Database: PostgreSQL with read replicas
  • Health Check: HTTP GET /health endpoint

Assumptions Made:

  • Load balancer would distribute traffic evenly
  • Health checks would properly detect unhealthy servers
  • Multiple backend servers provided redundancy

The Incident

Timeline:

  • 10:00 AM: Infrastructure team updated load balancer health check configuration
  • 10:05 AM: Health check path changed from /health to /api/health (typo in config)
  • 10:06 AM: Load balancer marked all servers as unhealthy (404 on health check)
  • 10:07 AM: Load balancer routing algorithm changed to "least connections"
  • 10:08 AM: All traffic routed to API Server 1 (only server that had a connection)
  • 10:10 AM: API Server 1 CPU usage spikes to 100%
  • 10:12 AM: API Server 1 crashes due to memory exhaustion
  • 10:13 AM: Service completely down (no healthy backends)
  • 10:15 AM: On-call engineer paged
  • 10:20 AM: Identified health check misconfiguration
  • 10:25 AM: Health check path corrected
  • 10:30 AM: All servers marked healthy, traffic restored
  • 10:35 AM: Service fully recovered

Symptoms

What We Saw:

  • Error Rate: Jumped from 0.1% to 100% in 3 minutes
  • Response Time: All requests timing out
  • Server Load: API Server 1 at 100% CPU, others at 0%
  • Health Check Status: All servers showing as unhealthy in load balancer
  • User Impact: Complete service outage for 20 minutes, ~200K requests failed

How We Detected It:

  • Alert fired when error rate exceeded 10%
  • Dashboard showed all servers as unhealthy
  • Load balancer metrics showed traffic only to one server

Monitoring Gaps:

  • No alert for load balancer health check failures
  • No alert for uneven traffic distribution
  • No alert for single server handling all traffic

Root Cause Analysis

Primary Cause: Misconfigured health check path in load balancer.

What Happened:

  1. Health check path changed from /health to /api/health (typo: missing leading slash)
  2. Load balancer tried to check /api/health but servers only had /health endpoint
  3. All health checks returned 404, marking all servers as unhealthy
  4. Load balancer routing algorithm switched to "least connections" mode
  5. API Server 1 had one existing connection, so all new traffic routed there
  6. API Server 1 received 100% of traffic (5M requests/day = ~58 req/sec)
  7. Server couldn't handle the load and crashed
  8. With no healthy backends, service went completely down

Why It Was So Bad:

  • No health check validation: Configuration change wasn't tested
  • No traffic distribution monitoring: We didn't notice uneven routing
  • No circuit breaker: Load balancer kept sending traffic to failing server
  • Single point of failure: One server crash took down entire service

Contributing Factors:

  • Manual configuration change (not automated)
  • No staging environment for infrastructure changes
  • No health check endpoint validation
  • Load balancer routing algorithm not understood

Fix & Mitigation

Immediate Fixes (During Incident):

  1. Corrected health check path: Changed from /api/health to /health
  2. Manually marked servers healthy: Temporarily bypassed health checks
  3. Restarted API Server 1: Brought crashed server back online

Long-Term Improvements:

  1. Health Check Validation:

    • Added automated tests for health check endpoints
    • Added staging environment for infrastructure changes
    • Added health check configuration validation
  2. Traffic Distribution Monitoring:

    • Added alert for uneven traffic distribution (alert if one server handles > 60% traffic)
    • Added dashboard for load balancer routing metrics
    • Added per-server request rate monitoring
  3. Infrastructure as Code:

    • Moved load balancer config to Terraform
    • Added infrastructure change review process
    • Added automated infrastructure testing
  4. Process Improvements:

    • Required staging environment testing for all infrastructure changes
    • Added runbook for load balancer incidents
    • Added infrastructure change approval process

Architecture After Fix

graph TB
    Client[Client] --> LB[Load Balancer<br/>AWS ALB<br/>Health: /health]
    LB --> API1[API Server 1<br/>Health Check]
    LB --> API2[API Server 2<br/>Health Check]
    LB --> API3[API Server 3<br/>Health Check]
    API1 --> DB[(Database)]
    API2 --> DB
    API3 --> DB
    LB --> Monitor[Traffic Distribution<br/>Monitoring]

Key Changes:

  • Validated health check endpoints
  • Traffic distribution monitoring
  • Infrastructure as code
  • Staging environment for infrastructure changes

Key Lessons

  1. Infrastructure configuration matters: A small typo in config can cause complete outages. Test infrastructure changes like code changes.

  2. Monitor traffic distribution: Uneven traffic distribution is a red flag. Alert if one server handles more than 60% of traffic.

  3. Health checks must be correct: Health check failures can cascade into complete outages. Validate health check endpoints.

  4. Use infrastructure as code: Manual configuration changes are error-prone. Use Terraform, CloudFormation, or similar tools.

  5. Test in staging first: Always test infrastructure changes in staging before production.


Interview Takeaways

Common Questions:

  • "How do you configure load balancers?"
  • "What happens when a load balancer health check fails?"
  • "How do you ensure even traffic distribution?"

What Interviewers Are Looking For:

  • Understanding of load balancer configuration
  • Knowledge of health check mechanisms
  • Awareness of traffic distribution patterns
  • Experience with infrastructure failures

What a Senior Engineer Would Do Differently

From the Start:

  1. Validate health checks: Test health check endpoints before deploying
  2. Monitor traffic distribution: Alert on uneven routing
  3. Use infrastructure as code: Automate configuration, reduce human error
  4. Test in staging: Always test infrastructure changes in staging first
  5. Add redundancy: Multiple load balancers, multiple availability zones

The Real Lesson: Infrastructure is code. Treat configuration changes with the same rigor as application code changes. Test, review, and monitor everything.


FAQs

Q: How do you configure load balancer health checks?

A: Health checks should hit a lightweight endpoint (like /health) that returns 200 OK when the server is healthy. Configure appropriate timeout, interval, and failure threshold values. Always test health check endpoints before deploying.

Q: What happens when all backend servers fail health checks?

A: The load balancer marks all servers as unhealthy and stops routing traffic. This causes a complete service outage. Always have multiple healthy backends and monitor health check status.

Q: How do you ensure even traffic distribution?

A: Monitor request rates per server and alert if distribution is uneven. Use appropriate load balancing algorithms (round-robin, least connections, etc.). Avoid sticky sessions unless necessary.

Q: Should you use multiple load balancers?

A: For high availability, yes. Use multiple load balancers across availability zones. For smaller services, a single load balancer with multiple backends might be sufficient, but monitor closely.

Q: How do you test infrastructure changes?

A: Use staging environments that mirror production. Test health checks, traffic distribution, and failover scenarios. Use infrastructure as code for reproducible testing.

Q: What's the difference between health checks and readiness probes?

A: Health checks determine if a server is alive and should receive traffic. Readiness probes determine if a server is ready to handle requests (might be starting up). Both are important for proper load balancing.

Q: How do you prevent single points of failure in load balancing?

A: Use multiple load balancers, multiple availability zones, multiple backend servers, and proper health checks. Monitor traffic distribution and alert on anomalies. Design for failure, not just success.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.