This is a story about how a single DNS record change—copy-pasted to the wrong environment—routed production traffic to staging for 45 minutes. It's about change management, DNS propagation, and why "it's just a config change" can be the most dangerous phrase in operations.
Context
We were migrating our API to a new region. The plan: update DNS to point to new servers. Staging was tested first. Production change was scheduled for a maintenance window.
What Went Wrong:
- Engineer had both staging and production DNS configs open
- Applied the staging DNS change to production by mistake
- Production traffic started hitting staging infrastructure
- Staging database began receiving production writes
The Incident
T+0
DNS change applied to production (wrong config)
T+5 min
First production requests hitting staging. Staging DB receiving production data
T+15 min
Users reporting 'wrong data' and 'missing features'
T+25 min
On-call identified DNS misconfiguration. Initiated rollback
T+45 min
DNS reverted. Propagation took 20 more minutes
T+65 min
All traffic restored to correct environment
Key Lessons
- DNS propagation is slow and unpredictable—TTL matters, rollback takes time.
- Change management: Require approval for production DNS. Use different accounts for staging vs prod.
- Environment isolation: Staging and production should be obviously different—different domains, different credentials.
- Runbooks: Document exactly which config goes where. One wrong click can be catastrophic.
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.