Real Engineering Stories

The DNS Change That Pointed Production at Staging

Q: Why didn't a 5-minute TTL make rollback fast?

TTL only controls how long **authoritative** answers are cached. Recursive resolvers, corporate proxies, and mobile OS DNS caches often hold records longer. During our incident, ~40% of traffic still hit staging **20 minutes** after Route 53 was corrected.

Q: How do you prevent applying staging config to production?

Use **separate AWS accounts**, Terraform workspace guards that fail CI on mismatch, mandatory second approver for DNS, and visually distinct target hostnames. Never rely on "I'll double-check the tab."

Q: What's the difference between DNS misconfiguration and a load balancer misconfiguration?

DNS sends clients to the wrong **IP/hostname** globally (slow to unwind). Load balancer misconfiguration routes traffic that already arrived at the edge (often faster to fix). Both cause outages; DNS mistakes add cache propagation delay—see [The Misconfigured Load Balancer](/real-engineering-stories/misconfigured-load-balancer).

Q: How do you monitor DNS health?

Run **synthetic checks** from multiple regions that resolve your public names and compare results to an allowlist of expected endpoints. Alert on any Route 53 change in production zones.

Q: When should you lower DNS TTL?

**48 hours before** a planned migration or cutover, drop TTL to 60 seconds. After traffic is stable for 24–48 hours, raise it again to reduce query load.

Q: What do you do with writes that landed in the wrong environment?

**Quarantine first**—export affected rows, analyze schema diffs, reconcile manually or discard. Auto-merging into production risks duplicate charges and corrupted audit trails.

A staging DNS change applied to production routed live traffic wrong for 45 minutes—8,200 stray writes and 65 minutes to recover.

Medium22 min read

This is a story about how a single DNS A-record change—copy-pasted from a staging Terraform plan into production—routed live traffic to staging for 45 minutes. Roughly 540,000 production requests hit staging infrastructure; ~8,200 writes landed in the staging PostgreSQL cluster before we caught it. The blast radius was not just "wrong servers"—customers saw missing features, stale catalog data, and orders that never appeared in production billing.

The lesson we took away: DNS is a distributed switch with no undo button. TTL and resolver caches mean rollback is measured in minutes, not seconds, and "it's just a config change" is the most dangerous phrase in operations.

Related reading on this site: For how DNS resolution and TTLs affect failover speed, see DNS Resolution Flow. For safe rollout patterns that reduce blast radius, read Deployment Strategies. For a sibling routing mistake at the load-balancer layer, see The Misconfigured Load Balancer. For environment separation during migrations, compare The Strangler Pattern in Production.

Context

We were migrating our B2B API from us-east-1 to eu-west-1. The plan: validate new EU servers in staging, then cut production DNS during a Saturday maintenance window. Traffic flowed through Route 53 → regional ALB → eight API pods → PostgreSQL primary with two read replicas. Peak traffic was ~12,000 requests/minute on weekdays.

Original Architecture:

Staging and production looked almost identical on paper—same hostname pattern (api.internal.example.com vs api-staging.internal.example.com), same Terraform modules, same dashboard colors. The only guardrail was discipline.

Route 53 routing production traffic incorrectly to staging ALB and staging database instead of production infrastructure — One wrong A-record sent ~540K production requests to staging—and 8,200 writes into the wrong database.

Technology Choices:

DNS: AWS Route 53, TTL 300 seconds (5 minutes) on the production A record
API: Kubernetes on EKS, identical Helm charts for staging and prod
Database: PostgreSQL 14; staging was a single-node instance with anonymized seed data
Change process: Terraform apply with peer review—no separate approval gate for DNS

Assumptions Made:

Engineers would apply changes to the correct AWS account (staging vs production)
Staging and production credentials were different enough to block cross-writes (they were not—shared IAM role on CI)
A 5-minute TTL meant "fast rollback" (ignored resolver caching and mobile app DNS caches)
Staging could never receive production traffic because hostnames differed (true until someone changed the wrong record)

The Incident

Saturday 09:00

Maintenance window opens. Engineer has staging and production Terraform plans side by side