← Back to Real Engineering Stories

Real Engineering Stories

The DNS Change That Pointed Production at Staging

A DNS configuration mistake during a migration pointed production traffic to the staging database, causing data corruption scares and a frantic 2-hour rollback.

Medium22 min read

This is a story about how a single DNS record change—copy-pasted to the wrong environment—routed production traffic to staging for 45 minutes. It's about change management, DNS propagation, and why "it's just a config change" can be the most dangerous phrase in operations.


Context

We were migrating our API to a new region. The plan: update DNS to point to new servers. Staging was tested first. Production change was scheduled for a maintenance window.

What Went Wrong:

  • Engineer had both staging and production DNS configs open
  • Applied the staging DNS change to production by mistake
  • Production traffic started hitting staging infrastructure
  • Staging database began receiving production writes

The Incident

T+0
DNS change applied to production (wrong config)
T+5 min
First production requests hitting staging. Staging DB receiving production data
T+15 min
Users reporting 'wrong data' and 'missing features'
T+25 min
On-call identified DNS misconfiguration. Initiated rollback
T+45 min
DNS reverted. Propagation took 20 more minutes
T+65 min
All traffic restored to correct environment

Key Lessons

  1. DNS propagation is slow and unpredictable—TTL matters, rollback takes time.
  2. Change management: Require approval for production DNS. Use different accounts for staging vs prod.
  3. Environment isolation: Staging and production should be obviously different—different domains, different credentials.
  4. Runbooks: Document exactly which config goes where. One wrong click can be catastrophic.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.