← Back to Design Thinking

Design Thinking

Cost-Aware Architecture

Design systems with cloud cost in mind. Over-engineering vs under-engineering, cost of reliability, and when cheaper infrastructure is the wrong choice.

Advanced20 min read

Cost is one of the most underrated topics in system design interviews—and in production. Senior engineers treat cost as a first-class constraint: they design with it in mind, know when to spend more for reliability, and avoid both over-engineering (waste) and under-engineering (expensive failures).


Designing Systems with Cloud Cost in Mind

Why Cost Matters

  • Unit economics: At scale, small inefficiencies multiply. 10% waste at $1M/month = $100K/year.
  • Runway: Startups have limited budgets. Cost overruns can kill a company.
  • Margins: In competitive markets, cost efficiency = margin = ability to invest.
  • Sustainability: Waste has environmental impact. Efficient systems are greener.

Cloud Cost Drivers

ResourceWhat Drives CostLevers
ComputeInstance hours, typeRight-sizing, spot/preemptible, auto-scaling
StorageGB-months, tierLifecycle policies, compression, cold storage
NetworkData transfer, cross-AZ/regionReduce cross-AZ, use CDN, compress
DatabaseInstance + storageReserved instances, read replicas vs scale-up
BandwidthEgressCDN, regional deployment, compression

Senior Approach

  1. Identify cost drivers for the system (compute, storage, network, DB)
  2. Estimate at scale: "At 10M users, storage = X, bandwidth = Y"
  3. Design for cost: Right-size, use spot where possible, minimize cross-AZ
  4. Monitor: Cost per user, cost per request, cost trends

Over-Engineering vs Under-Engineering (Cost Lens)

Over-Engineering = Wasting Money

  • Unused capacity: Provisioned for 10x peak "just in case"
  • Premature scale: Microservices, multi-region, Kafka when monolith + single DB would do
  • Gold-plating: Perfect solution when "good enough" saves 80% cost
  • Complexity cost: More services = more monitoring, more ops, more debugging time

Under-Engineering = Expensive Failures

  • Single point of failure: One outage can cost more than redundancy
  • No autoscaling: Manual scaling, slow response to traffic spikes
  • Cheap storage, expensive queries: Wrong DB choice = high compute cost
  • No caching: Hitting DB for every request when cache would cut cost 10x

The Balance

Spend on:

  • Reliability for revenue-critical paths
  • Observability (you can't fix what you can't see)
  • Core infrastructure that's hard to change later

Save on:

  • Non-critical paths (degraded is OK)
  • Over-provisioning ("we might need it")
  • Premature optimization

Real Numbers: AWS Example

  • Multi-AZ RDS vs single-AZ: ~2x cost. Worth it for production DB.
  • ElastiCache for hot data: Can reduce DB load 10x, often pays for itself.
  • Reserved instances vs on-demand: 30–60% savings for steady load.
  • Spot instances for batch: 70–90% savings. Risk: interruption.

Cost of Reliability

What Reliability Costs

Reliability PatternCostWhen Worth It
Multi-AZ~2x for DB, computeProduction, revenue-critical
Multi-region3–5xGlobal users, compliance
BackupsStorage + transferAlways for critical data
Redundant queues2xWhen message loss is unacceptable
Circuit breakersDev timeWhen cascading failure is risk
Chaos engineeringDev time + riskWhen failure modes are complex

When to Spend on Reliability

  • Revenue-impacting: Downtime = lost sales
  • Compliance: SOC2, HIPAA require redundancy
  • User trust: Banking, healthcare, auth
  • Hard to fix later: Data model, infra choices

When to Accept Less

  • Internal tools: Short outage may be OK
  • Non-critical features: Degraded is acceptable
  • Early-stage product: Speed > perfection
  • Batch jobs: Can retry, not real-time

Senior Insight

"The cost of one hour of downtime for our payment system is $X. Multi-AZ costs $Y/month. If we have one outage per year, we break even. We have 2–3. Multi-AZ pays for itself." — Quantify the trade-off.


When Cheaper Infrastructure Is the Wrong Choice

False Economy

Cheap VMs, no managed DB: You save on RDS, spend 2x on eng time for backups, failover, scaling. Total cost of ownership (TCO) is higher.

Single region: You save 50% on infra. One regional outage loses customers and revenue. One incident can exceed years of savings.

No CDN: You save on CloudFront. Your origin gets hammered, you scale up instances. Bandwidth + compute often exceeds CDN cost.

Cheap object storage, expensive egress: S3 Standard is cheap. Egress is expensive. If you serve lots of data, CDN + lower-tier storage can reduce TCO.

When "Expensive" Is Cheaper

  • Managed services: RDS, ElastiCache, managed Kafka. Higher hourly cost, lower TCO (no ops burden).
  • Right-sized instances: Larger instance can be cheaper than many small ones (fewer management overhead).
  • Reserved capacity: Commit for 1–3 years, save 30–60%. Worth it for baseline load.
  • Correct architecture: A well-designed system can cost less than a "cheap" one that doesn't scale.

Senior Decision Framework

  1. TCO, not just unit cost: Include ops, incident response, engineering time
  2. Cost of failure: One outage can exceed years of savings
  3. Lock-in vs flexibility: Vendor lock-in can be costly later
  4. Scale assumptions: What's cheap at 1K users may be wrong at 1M

Thinking Aloud Like a Senior Engineer

Problem: "Design a system to serve 10M images. Budget-conscious."

My first instinct: "S3 + CloudFront. Standard approach."

Cost check: 10M images, 200KB avg = 2TB storage. S3 Standard: ~$46/month. But egress: if each image viewed 10x/month = 20TB egress. At $0.09/GB = $1,800/month. Egress dominates.

Mitigation: CloudFront in front of S3. Caching at edge. If 80% cache hit, egress from origin = 4TB. Plus CloudFront cost. Might still be $500–800/month total. But way better than $1,800.

Storage tier: Most images old, rarely accessed. Lifecycle to S3 IA or Glacier after 90 days. Storage cost drops 50%+.

Image size: Can we serve WebP? Smaller files = less bandwidth. 30% size reduction = 30% cost reduction.

Decision: S3 + CloudFront, lifecycle policies, WebP/optimization. Design for cost from the start.


Best Practices

  1. Estimate cost at target scale before building
  2. Monitor cost per user/request as a core metric
  3. Right-size: Don't over-provision "to be safe"
  4. Use managed services when TCO is lower
  5. Spend on reliability where cost of failure > cost of redundancy

Summary

Cost-aware architecture means:

  • Treat cost as a constraint from the start
  • Balance over-engineering (waste) vs under-engineering (expensive failures)
  • Quantify cost of reliability and when it pays off
  • Avoid false economy: cheaper infra that increases TCO or risk

FAQs

Q: How do I bring up cost in an interview?

A: "Given we're cost-conscious, I'd use X instead of Y because..." or "At this scale, the main cost drivers would be... I'd optimize for..."

Q: When should we optimize for cost vs speed of development?

A: Early stage: speed. Post-PMF, scaling: cost. When runway or margin is a concern: cost earlier.

Q: What's the biggest cost mistake teams make?

A: Ignoring egress and data transfer. Storage is cheap; moving data is expensive. Design to minimize cross-AZ and cross-region traffic.

Keep exploring

Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.