Design Thinking
Cost-Aware Architecture
Design systems with cloud cost in mind. Over-engineering vs under-engineering, cost of reliability, and when cheaper infrastructure is the wrong choice.
Cost is one of the most underrated topics in system design interviews—and in production. Senior engineers treat cost as a first-class constraint: they design with it in mind, know when to spend more for reliability, and avoid both over-engineering (waste) and under-engineering (expensive failures).
Designing Systems with Cloud Cost in Mind
Why Cost Matters
- Unit economics: At scale, small inefficiencies multiply. 10% waste at $1M/month = $100K/year.
- Runway: Startups have limited budgets. Cost overruns can kill a company.
- Margins: In competitive markets, cost efficiency = margin = ability to invest.
- Sustainability: Waste has environmental impact. Efficient systems are greener.
Cloud Cost Drivers
| Resource | What Drives Cost | Levers |
|---|---|---|
| Compute | Instance hours, type | Right-sizing, spot/preemptible, auto-scaling |
| Storage | GB-months, tier | Lifecycle policies, compression, cold storage |
| Network | Data transfer, cross-AZ/region | Reduce cross-AZ, use CDN, compress |
| Database | Instance + storage | Reserved instances, read replicas vs scale-up |
| Bandwidth | Egress | CDN, regional deployment, compression |
Senior Approach
- Identify cost drivers for the system (compute, storage, network, DB)
- Estimate at scale: "At 10M users, storage = X, bandwidth = Y"
- Design for cost: Right-size, use spot where possible, minimize cross-AZ
- Monitor: Cost per user, cost per request, cost trends
Over-Engineering vs Under-Engineering (Cost Lens)
Over-Engineering = Wasting Money
- Unused capacity: Provisioned for 10x peak "just in case"
- Premature scale: Microservices, multi-region, Kafka when monolith + single DB would do
- Gold-plating: Perfect solution when "good enough" saves 80% cost
- Complexity cost: More services = more monitoring, more ops, more debugging time
Under-Engineering = Expensive Failures
- Single point of failure: One outage can cost more than redundancy
- No autoscaling: Manual scaling, slow response to traffic spikes
- Cheap storage, expensive queries: Wrong DB choice = high compute cost
- No caching: Hitting DB for every request when cache would cut cost 10x
The Balance
Spend on:
- Reliability for revenue-critical paths
- Observability (you can't fix what you can't see)
- Core infrastructure that's hard to change later
Save on:
- Non-critical paths (degraded is OK)
- Over-provisioning ("we might need it")
- Premature optimization
Real Numbers: AWS Example
- Multi-AZ RDS vs single-AZ: ~2x cost. Worth it for production DB.
- ElastiCache for hot data: Can reduce DB load 10x, often pays for itself.
- Reserved instances vs on-demand: 30–60% savings for steady load.
- Spot instances for batch: 70–90% savings. Risk: interruption.
Cost of Reliability
What Reliability Costs
| Reliability Pattern | Cost | When Worth It |
|---|---|---|
| Multi-AZ | ~2x for DB, compute | Production, revenue-critical |
| Multi-region | 3–5x | Global users, compliance |
| Backups | Storage + transfer | Always for critical data |
| Redundant queues | 2x | When message loss is unacceptable |
| Circuit breakers | Dev time | When cascading failure is risk |
| Chaos engineering | Dev time + risk | When failure modes are complex |
When to Spend on Reliability
- Revenue-impacting: Downtime = lost sales
- Compliance: SOC2, HIPAA require redundancy
- User trust: Banking, healthcare, auth
- Hard to fix later: Data model, infra choices
When to Accept Less
- Internal tools: Short outage may be OK
- Non-critical features: Degraded is acceptable
- Early-stage product: Speed > perfection
- Batch jobs: Can retry, not real-time
Senior Insight
"The cost of one hour of downtime for our payment system is $X. Multi-AZ costs $Y/month. If we have one outage per year, we break even. We have 2–3. Multi-AZ pays for itself." — Quantify the trade-off.
When Cheaper Infrastructure Is the Wrong Choice
False Economy
Cheap VMs, no managed DB: You save on RDS, spend 2x on eng time for backups, failover, scaling. Total cost of ownership (TCO) is higher.
Single region: You save 50% on infra. One regional outage loses customers and revenue. One incident can exceed years of savings.
No CDN: You save on CloudFront. Your origin gets hammered, you scale up instances. Bandwidth + compute often exceeds CDN cost.
Cheap object storage, expensive egress: S3 Standard is cheap. Egress is expensive. If you serve lots of data, CDN + lower-tier storage can reduce TCO.
When "Expensive" Is Cheaper
- Managed services: RDS, ElastiCache, managed Kafka. Higher hourly cost, lower TCO (no ops burden).
- Right-sized instances: Larger instance can be cheaper than many small ones (fewer management overhead).
- Reserved capacity: Commit for 1–3 years, save 30–60%. Worth it for baseline load.
- Correct architecture: A well-designed system can cost less than a "cheap" one that doesn't scale.
Senior Decision Framework
- TCO, not just unit cost: Include ops, incident response, engineering time
- Cost of failure: One outage can exceed years of savings
- Lock-in vs flexibility: Vendor lock-in can be costly later
- Scale assumptions: What's cheap at 1K users may be wrong at 1M
Thinking Aloud Like a Senior Engineer
Problem: "Design a system to serve 10M images. Budget-conscious."
My first instinct: "S3 + CloudFront. Standard approach."
Cost check: 10M images, 200KB avg = 2TB storage. S3 Standard: ~$46/month. But egress: if each image viewed 10x/month = 20TB egress. At $0.09/GB = $1,800/month. Egress dominates.
Mitigation: CloudFront in front of S3. Caching at edge. If 80% cache hit, egress from origin = 4TB. Plus CloudFront cost. Might still be $500–800/month total. But way better than $1,800.
Storage tier: Most images old, rarely accessed. Lifecycle to S3 IA or Glacier after 90 days. Storage cost drops 50%+.
Image size: Can we serve WebP? Smaller files = less bandwidth. 30% size reduction = 30% cost reduction.
Decision: S3 + CloudFront, lifecycle policies, WebP/optimization. Design for cost from the start.
Best Practices
- Estimate cost at target scale before building
- Monitor cost per user/request as a core metric
- Right-size: Don't over-provision "to be safe"
- Use managed services when TCO is lower
- Spend on reliability where cost of failure > cost of redundancy
Summary
Cost-aware architecture means:
- Treat cost as a constraint from the start
- Balance over-engineering (waste) vs under-engineering (expensive failures)
- Quantify cost of reliability and when it pays off
- Avoid false economy: cheaper infra that increases TCO or risk
FAQs
Q: How do I bring up cost in an interview?
A: "Given we're cost-conscious, I'd use X instead of Y because..." or "At this scale, the main cost drivers would be... I'd optimize for..."
Q: When should we optimize for cost vs speed of development?
A: Early stage: speed. Post-PMF, scaling: cost. When runway or margin is a concern: cost earlier.
Q: What's the biggest cost mistake teams make?
A: Ignoring egress and data transfer. Storage is cheap; moving data is expensive. Design to minimize cross-AZ and cross-region traffic.
Apply This Thinking
Practice what you've learned with these related system design questions:
Keep exploring
Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.