← Back to Design Thinking

Design Thinking

Failure-First Design Thinking

Design for failure before scale. Learn failure modes, blast radius analysis, graceful degradation vs hard failure, and SLOs/SLIs/error budgets in system design.

Advanced24 min read

Senior engineers design for failure before they design for scale. While juniors focus on the happy path, seniors ask: "What breaks first? Who does it affect? How do we degrade gracefully?" This mindset—failure-first design thinking—is a strong signal in system design interviews and in production.


Designing for Failure Before Scale

Why Failure First?

  • Failure is inevitable: Disks die, networks partition, services crash. Design assumes it.
  • Scale amplifies failure: At 1M users, a 0.1% failure rate is 1,000 unhappy users. At 100M, it's 100,000.
  • Recovery time matters: Can you fail fast, detect quickly, and recover automatically?
  • User impact matters: Does the whole system go down, or do users see degraded but working service?

The Failure-First Mindset

Before drawing boxes and arrows:

  1. List what can fail (disk, network, service, dependency, region)
  2. For each, ask: What's the blast radius? Who is affected?
  3. Define: What must never fail? What can degrade?
  4. Design: Isolation, fallbacks, and detection

Failure Modes & Blast Radius Analysis

Common Failure Modes

ComponentFailure ModeExample
DatabaseUnavailable, slow, corruptedPrimary dies, replication lag
CacheMiss, stale, unavailableRedis OOM, network partition
QueueBacklog, lost messages, slow consumerKafka broker down, consumer crash
External APITimeout, 5xx, rate limitPayment provider outage
NetworkPartition, latency spike, packet lossAZ outage,跨-region latency
DiskFull, slow I/O, corruptionEBS failure, disk full

Blast Radius

Blast radius = How many users, requests, or systems are affected when this component fails?

Example: Single Redis instance

  • Blast radius: All users whose requests hit that cache
  • Mitigation: Cache is a performance optimization; fall back to DB. Blast radius = increased latency for all, not total failure.

Example: Single database primary

  • Blast radius: All write traffic, possibly all reads if no replicas
  • Mitigation: Failover to replica. Blast radius = brief write unavailability during failover.

Example: Single-region deployment

  • Blast radius: Entire user base if region goes down
  • Mitigation: Multi-region. Blast radius = users in one region during regional outage.

Senior Interview Move

Explicitly state: "Let me analyze failure modes. For the database, the main failure mode is primary outage. Blast radius: all writes. Mitigation: synchronous replica + automated failover. For the cache, failure mode is Redis down. Blast radius: latency for all reads. Mitigation: fallback to DB—we accept higher latency, not data loss."


Graceful Degradation vs Hard Failure

Graceful Degradation

System continues to operate with reduced functionality when a component fails.

Examples:

  • Recommendations down: Show popular items instead of personalized
  • Search down: Show recent items or category browse
  • Image CDN down: Serve lower quality or placeholder
  • Real-time presence down: Show "last seen" instead of "online"

Hard Failure

System returns an error when a component fails. No fallback.

When to choose hard failure:

  • Correctness critical: Payments, medical data, compliance
  • No sensible fallback: Auth failure = can't let user in
  • Simplicity: Fallback adds complexity; hard fail is clearer

Decision Framework

ScenarioGraceful DegradationHard Failure
Non-critical feature
Revenue-impacting✓ (with monitoring)
Data correctness
Security/auth
Simple system✓ (fewer code paths)

SLOs, SLIs, and Error Budgets in Design

Definitions

  • SLI (Service Level Indicator): A measurable value (e.g., latency p99, error rate, availability)
  • SLO (Service Level Objective): Target for the SLI (e.g., 99.9% availability, p99 < 200ms)
  • Error budget: Allowed failure (e.g., 0.1% = 43 min/month downtime)

How SLOs Shape Design

If your SLO is 99.9% availability:

  • Single point of failure? One 4-hour outage blows the budget.
  • Need: Redundancy, failover, multi-AZ.

If your SLO is p99 latency < 200ms:

  • Cache misses cause DB reads = 50ms+. At scale, p99 can spike.
  • Need: Caching strategy, connection pooling, query optimization.

If your error budget is 0.1%:

  • You can deploy 10 times with 0.01% failure each, or 1 deploy with 0.1% failure.
  • Error budget justifies risk-taking: "We have budget; we can ship this change."

Senior Insight

SLOs force you to make trade-offs explicit. "We can't have 99.99% availability with a single DB. We need multi-AZ + automated failover. That adds cost and complexity. Is 99.9% acceptable?" The business decision drives the architecture.


Case Study: Production Incident That Could Have Been Designed For

Scenario

A popular app had a single Redis instance for session storage. Redis ran out of memory, crashed, and all users were logged out. The team had to scale Redis and restore from persistence. 30 minutes of total logout for millions of users.

Failure-First Design Would Have

  1. Identified: Session store is single point of failure. Blast radius: all users.
  2. Designed: Redis Cluster or multi-instance with failover. Or: session in DB with cache—Redis down = slower session lookup, not total loss.
  3. SLO: If availability SLO is 99.9%, single Redis violates it.
  4. Mitigation: Eviction policy, memory limits, alerting. Graceful degradation: "Session expired, please log in again" with minimal disruption.

Lesson

The incident wasn't unpredictable. It was a known failure mode (Redis OOM) with high blast radius. Failure-first design would have addressed it before production.


Thinking Aloud Like a Senior Engineer

Problem: "Design a notification system. 100M users, email + push."

My first instinct: "API, queue, workers, email service, push service. Done."

But failure-first: What breaks? Queue backs up. Email provider is down. Push service rate-limits us. Workers crash.

Queue backup: Blast radius = delayed notifications. Mitigation: Dead letter queue, alerting, scale workers. Graceful: Notifications are eventually delivered. Acceptable.

Email provider down: Blast radius = no email for anyone. Mitigation: Fallback provider (e.g., SendGrid + SES). Or: Queue for retry. Graceful: Push still works; email delayed.

Push service rate limit: Blast radius = some users don't get push. Mitigation: Multiple push providers, batching, priority queuing. Graceful: Critical notifications first; rest delayed.

Workers crash: Messages stay in queue. Another worker picks up. At-least-once delivery. Idempotent handlers. No data loss.

SLO: 99.9% of notifications delivered within 5 minutes. Error budget allows 0.1% late or lost. We design for that.


How a Senior Engineer Thinks About Failure

  1. List failure modes for each component before designing
  2. Quantify blast radius: Users, requests, revenue
  3. Define degradation for each failure: What do users see?
  4. Connect to SLOs: Does this design meet our objectives?
  5. Design detection: How do we know when something has failed?

Best Practices

  1. Assume failure: Every component will fail; design for it
  2. Minimize blast radius: Isolate failures with boundaries
  3. Prefer graceful degradation for non-critical paths
  4. Use hard failure when correctness or security is at stake
  5. Define SLOs early: They drive redundancy and fallback design

Summary

Failure-first design thinking means:

  • Design for failure before scale
  • Identify failure modes and blast radius for key components
  • Choose graceful degradation or hard failure based on criticality
  • Use SLOs and error budgets to drive architectural decisions
  • Detect and recover automatically where possible

FAQs

Q: When is graceful degradation not worth it?

A: When the fallback is confusing (showing wrong data), when implementation cost is high for low-value feature, or when hard failure is simpler and acceptable.

Q: How do I introduce SLOs if we don't have them?

A: Start with one critical SLI (e.g., availability or latency). Measure current baseline. Set a target. Use it to justify one architectural change. Expand from there.

Q: What's the most common failure mode teams miss?

A: Cascading failure: one component fails, others overload and fail. Design for backpressure, timeouts, and circuit breakers to prevent cascade.

Keep exploring

Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.