This is a story about the day we learned that our "high availability" database setup wasn't. The primary crashed. The replica was supposed to take over. It didn't. We discovered that failover was a theory we'd never actually tested.
Context
Setup:
- Primary PostgreSQL + 2 replicas (sync replication to replica 1, async to replica 2)
- Automatic failover via Patroni/etcd
- Last failover test: 18 months ago (manual, during maintenance)
- Replication lag monitoring: Alert at 10 seconds (we'd never hit it)
Assumptions Made:
- Failover would work when needed
- Replicas were always in sync
- The failover automation was correct (we'd set it up once)
The Incident
2:14 AM
Primary database OOM killed (memory leak in long-running query)
2:14 AM
Patroni attempted failover. Replica 1 promotion failed—replication lag 45 seconds
2:15 AM
Tried Replica 2. Promotion failed—sync_check failed (missing transactions)
2:20 AM
On-call paged. No database. Full outage
2:45 AM
Primary restarted manually. 30 min to recover, replicate to catch up
3:44 AM
Primary back online. 90-minute outage. Zero automatic failover
Root Cause Analysis
Why failover failed:
- Replica 1: Under heavy read load, replication had fallen 45 seconds behind. Patroni refused to promote (data loss risk)
- Replica 2: Async replica—had never been validated for promotion. Script assumed sync replica.
- Failover script bug: Had a hardcoded assumption about replica roles. Never run in production.
- No regular testing: Failover hadn't been tested in 18 months. Config had drifted.
Fix & Mitigation
- Monthly failover drills: Chaos engineer primary kill, validate failover works
- Replication lag alerting: 1 second lag = warning. 5 seconds = page
- Failover runbook: Documented manual failover steps. Tested quarterly
- Replica capacity: Replicas now have headroom for replication during read spikes
- Automated validation: Health check that verifies failover path weekly
Key Lessons
- If you haven't tested failover, you don't have failover—you have hope
- Replication lag is a silent killer—monitor it like your life depends on it
- Async replicas are for read scaling, not failover—unless you've tested and accepted data loss
- HA systems rot—config drifts, code changes, assumptions break. Test constantly.
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.