Design Thinking

Cost-Aware Architecture

Design systems with cloud cost in mind. Over-engineering vs under-engineering, cost of reliability, and when cheaper infrastructure is the wrong choice.

Advanced20 min read

Cost is one of the most underrated topics in system design interviews—and in production. Senior engineers treat cost as a first-class constraint: they design with it in mind, know when to spend more for reliability, and avoid both over-engineering (waste) and under-engineering (expensive failures).

Designing Systems with Cloud Cost in Mind

Why Cost Matters

Unit economics: At scale, small inefficiencies multiply. 10% waste at $1M/month = $100K/year.
Runway: Startups have limited budgets. Cost overruns can kill a company.
Margins: In competitive markets, cost efficiency = margin = ability to invest.
Sustainability: Waste has environmental impact. Efficient systems are greener.

Cloud Cost Drivers

Resource	What Drives Cost	Levers
Compute	Instance hours, type	Right-sizing, spot/preemptible, auto-scaling
Storage	GB-months, tier	Lifecycle policies, compression, cold storage
Network	Data transfer, cross-AZ/region	Reduce cross-AZ, use CDN, compress
Database	Instance + storage	Reserved instances, read replicas vs scale-up
Bandwidth	Egress	CDN, regional deployment, compression

Senior Approach

Identify cost drivers for the system (compute, storage, network, DB)
Estimate at scale: "At 10M users, storage = X, bandwidth = Y"
Design for cost: Right-size, use spot where possible, minimize cross-AZ
Monitor: Cost per user, cost per request, cost trends

Over-Engineering vs Under-Engineering (Cost Lens)

Over-Engineering = Wasting Money

Unused capacity: Provisioned for 10x peak "just in case"
Premature scale: Microservices, multi-region, Kafka when monolith + single DB would do
Gold-plating: Perfect solution when "good enough" saves 80% cost
Complexity cost: More services = more monitoring, more ops, more debugging time

Under-Engineering = Expensive Failures

Single point of failure: One outage can cost more than redundancy
No autoscaling: Manual scaling, slow response to traffic spikes
Cheap storage, expensive queries: Wrong DB choice = high compute cost
No caching: Hitting DB for every request when cache would cut cost 10x

The Balance

Spend on:

Reliability for revenue-critical paths
Observability (you can't fix what you can't see)
Core infrastructure that's hard to change later

Save on:

Non-critical paths (degraded is OK)
Over-provisioning ("we might need it")
Premature optimization

Real Numbers: AWS Example

Multi-AZ RDS vs single-AZ: ~2x cost. Worth it for production DB.
ElastiCache for hot data: Can reduce DB load 10x, often pays for itself.
Reserved instances vs on-demand: 30–60% savings for steady load.
Spot instances for batch: 70–90% savings. Risk: interruption.

Cost of Reliability

What Reliability Costs

Reliability Pattern	Cost	When Worth It
Multi-AZ	~2x for DB, compute	Production, revenue-critical
Multi-region	3–5x	Global users, compliance
Backups	Storage + transfer	Always for critical data
Redundant queues	2x	When message loss is unacceptable
Circuit breakers	Dev time	When cascading failure is risk
Chaos engineering	Dev time + risk	When failure modes are complex

When to Spend on Reliability

Revenue-impacting: Downtime = lost sales
Compliance: SOC2, HIPAA require redundancy
User trust: Banking, healthcare, auth
Hard to fix later: Data model, infra choices

When to Accept Less

Internal tools: Short outage may be OK
Non-critical features: Degraded is acceptable
Early-stage product: Speed > perfection
Batch jobs: Can retry, not real-time

Senior Insight

"The cost of one hour of downtime for our payment system is $X. Multi-AZ costs $Y/month. If we have one outage per year, we break even. We have 2–3. Multi-AZ pays for itself." — Quantify the trade-off.

When Cheaper Infrastructure Is the Wrong Choice

False Economy

Cheap VMs, no managed DB: You save on RDS, spend 2x on eng time for backups, failover, scaling. Total cost of ownership (TCO) is higher.

Single region: You save 50% on infra. One regional outage loses customers and revenue. One incident can exceed years of savings.

No CDN: You save on CloudFront. Your origin gets hammered, you scale up instances. Bandwidth + compute often exceeds CDN cost.

Cheap object storage, expensive egress: S3 Standard is cheap. Egress is expensive. If you serve lots of data, CDN + lower-tier storage can reduce TCO.

When "Expensive" Is Cheaper

Managed services: RDS, ElastiCache, managed Kafka. Higher hourly cost, lower TCO (no ops burden).
Right-sized instances: Larger instance can be cheaper than many small ones (fewer management overhead).
Reserved capacity: Commit for 1–3 years, save 30–60%. Worth it for baseline load.
Correct architecture: A well-designed system can cost less than a "cheap" one that doesn't scale.

Senior Decision Framework

TCO, not just unit cost: Include ops, incident response, engineering time
Cost of failure: One outage can exceed years of savings
Lock-in vs flexibility: Vendor lock-in can be costly later
Scale assumptions: What's cheap at 1K users may be wrong at 1M

Thinking Aloud Like a Senior Engineer

Problem: "Design a system to serve 10M images. Budget-conscious."

My first instinct: "S3 + CloudFront. Standard approach."

Cost check: 10M images, 200KB avg = 2TB storage. S3 Standard: ~$46/month. But egress: if each image viewed 10x/month = 20TB egress. At $0.09/GB = $1,800/month. Egress dominates.

Mitigation: CloudFront in front of S3. Caching at edge. If 80% cache hit, egress from origin = 4TB. Plus CloudFront cost. Might still be $500–800/month total. But way better than $1,800.

Storage tier: Most images old, rarely accessed. Lifecycle to S3 IA or Glacier after 90 days. Storage cost drops 50%+.

Image size: Can we serve WebP? Smaller files = less bandwidth. 30% size reduction = 30% cost reduction.

Decision: S3 + CloudFront, lifecycle policies, WebP/optimization. Design for cost from the start.

Best Practices

Estimate cost at target scale before building
Monitor cost per user/request as a core metric
Right-size: Don't over-provision "to be safe"
Use managed services when TCO is lower
Spend on reliability where cost of failure > cost of redundancy

Summary

Cost-aware architecture means:

Treat cost as a constraint from the start
Balance over-engineering (waste) vs under-engineering (expensive failures)
Quantify cost of reliability and when it pays off
Avoid false economy: cheaper infra that increases TCO or risk

FAQs

Q: How do I bring up cost in an interview?

A: "Given we're cost-conscious, I'd use X instead of Y because..." or "At this scale, the main cost drivers would be... I'd optimize for..."

Q: When should we optimize for cost vs speed of development?

A: Early stage: speed. Post-PMF, scaling: cost. When runway or margin is a concern: cost earlier.

Q: What's the biggest cost mistake teams make?

A: Ignoring egress and data transfer. Storage is cheap; moving data is expensive. Design to minimize cross-AZ and cross-region traffic.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design TinyURL with Analytics

Balance cost with scale for a redirect + analytics system.

Easy

Design a CDN

Reason about bandwidth costs, edge caching, and data transfer.

Medium

Design Instagram

Consider storage, bandwidth, and compute costs at billions of images.

Hard

Explore More Practice Questions →

Keep exploring

Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.

View All Topics Practice System Design →