Design Thinking

Failure-First Design Thinking

Design for failure before scale. Learn failure modes, blast radius analysis, graceful degradation vs hard failure, and SLOs/SLIs/error budgets in system design.

Advanced24 min read

Senior engineers design for failure before they design for scale. While juniors focus on the happy path, seniors ask: "What breaks first? Who does it affect? How do we degrade gracefully?" This mindset—failure-first design thinking—is a strong signal in system design interviews and in production.

Designing for Failure Before Scale

Why Failure First?

Failure is inevitable: Disks die, networks partition, services crash. Design assumes it.
Scale amplifies failure: At 1M users, a 0.1% failure rate is 1,000 unhappy users. At 100M, it's 100,000.
Recovery time matters: Can you fail fast, detect quickly, and recover automatically?
User impact matters: Does the whole system go down, or do users see degraded but working service?

The Failure-First Mindset

Before drawing boxes and arrows:

List what can fail (disk, network, service, dependency, region)
For each, ask: What's the blast radius? Who is affected?
Define: What must never fail? What can degrade?
Design: Isolation, fallbacks, and detection

Failure Modes & Blast Radius Analysis

Common Failure Modes

Component	Failure Mode	Example
Database	Unavailable, slow, corrupted	Primary dies, replication lag
Cache	Miss, stale, unavailable	Redis OOM, network partition
Queue	Backlog, lost messages, slow consumer	Kafka broker down, consumer crash
External API	Timeout, 5xx, rate limit	Payment provider outage
Network	Partition, latency spike, packet loss	AZ outage,跨-region latency
Disk	Full, slow I/O, corruption	EBS failure, disk full

Blast Radius

Blast radius = How many users, requests, or systems are affected when this component fails?

Example: Single Redis instance

Blast radius: All users whose requests hit that cache
Mitigation: Cache is a performance optimization; fall back to DB. Blast radius = increased latency for all, not total failure.

Example: Single database primary

Blast radius: All write traffic, possibly all reads if no replicas
Mitigation: Failover to replica. Blast radius = brief write unavailability during failover.

Example: Single-region deployment

Blast radius: Entire user base if region goes down
Mitigation: Multi-region. Blast radius = users in one region during regional outage.

Senior Interview Move

Explicitly state: "Let me analyze failure modes. For the database, the main failure mode is primary outage. Blast radius: all writes. Mitigation: synchronous replica + automated failover. For the cache, failure mode is Redis down. Blast radius: latency for all reads. Mitigation: fallback to DB—we accept higher latency, not data loss."

Graceful Degradation vs Hard Failure

Graceful Degradation

System continues to operate with reduced functionality when a component fails.

Examples:

Recommendations down: Show popular items instead of personalized
Search down: Show recent items or category browse
Image CDN down: Serve lower quality or placeholder
Real-time presence down: Show "last seen" instead of "online"

Hard Failure

System returns an error when a component fails. No fallback.

When to choose hard failure:

Correctness critical: Payments, medical data, compliance
No sensible fallback: Auth failure = can't let user in
Simplicity: Fallback adds complexity; hard fail is clearer

Decision Framework

Scenario	Graceful Degradation	Hard Failure
Non-critical feature	✓
Revenue-impacting	✓ (with monitoring)
Data correctness		✓
Security/auth		✓
Simple system		✓ (fewer code paths)

SLOs, SLIs, and Error Budgets in Design

Definitions

SLI (Service Level Indicator): A measurable value (e.g., latency p99, error rate, availability)
SLO (Service Level Objective): Target for the SLI (e.g., 99.9% availability, p99 < 200ms)
Error budget: Allowed failure (e.g., 0.1% = 43 min/month downtime)

How SLOs Shape Design

If your SLO is 99.9% availability:

Single point of failure? One 4-hour outage blows the budget.
Need: Redundancy, failover, multi-AZ.

If your SLO is p99 latency < 200ms:

Cache misses cause DB reads = 50ms+. At scale, p99 can spike.
Need: Caching strategy, connection pooling, query optimization.

If your error budget is 0.1%:

You can deploy 10 times with 0.01% failure each, or 1 deploy with 0.1% failure.
Error budget justifies risk-taking: "We have budget; we can ship this change."

Senior Insight

SLOs force you to make trade-offs explicit. "We can't have 99.99% availability with a single DB. We need multi-AZ + automated failover. That adds cost and complexity. Is 99.9% acceptable?" The business decision drives the architecture.

Case Study: Production Incident That Could Have Been Designed For

Scenario

A popular app had a single Redis instance for session storage. Redis ran out of memory, crashed, and all users were logged out. The team had to scale Redis and restore from persistence. 30 minutes of total logout for millions of users.

Failure-First Design Would Have

Identified: Session store is single point of failure. Blast radius: all users.
Designed: Redis Cluster or multi-instance with failover. Or: session in DB with cache—Redis down = slower session lookup, not total loss.
SLO: If availability SLO is 99.9%, single Redis violates it.
Mitigation: Eviction policy, memory limits, alerting. Graceful degradation: "Session expired, please log in again" with minimal disruption.

Lesson

The incident wasn't unpredictable. It was a known failure mode (Redis OOM) with high blast radius. Failure-first design would have addressed it before production.

Thinking Aloud Like a Senior Engineer

Problem: "Design a notification system. 100M users, email + push."

My first instinct: "API, queue, workers, email service, push service. Done."

But failure-first: What breaks? Queue backs up. Email provider is down. Push service rate-limits us. Workers crash.

Queue backup: Blast radius = delayed notifications. Mitigation: Dead letter queue, alerting, scale workers. Graceful: Notifications are eventually delivered. Acceptable.

Email provider down: Blast radius = no email for anyone. Mitigation: Fallback provider (e.g., SendGrid + SES). Or: Queue for retry. Graceful: Push still works; email delayed.

Push service rate limit: Blast radius = some users don't get push. Mitigation: Multiple push providers, batching, priority queuing. Graceful: Critical notifications first; rest delayed.

Workers crash: Messages stay in queue. Another worker picks up. At-least-once delivery. Idempotent handlers. No data loss.

SLO: 99.9% of notifications delivered within 5 minutes. Error budget allows 0.1% late or lost. We design for that.

How a Senior Engineer Thinks About Failure

List failure modes for each component before designing
Quantify blast radius: Users, requests, revenue
Define degradation for each failure: What do users see?
Connect to SLOs: Does this design meet our objectives?
Design detection: How do we know when something has failed?

Best Practices

Assume failure: Every component will fail; design for it
Minimize blast radius: Isolate failures with boundaries
Prefer graceful degradation for non-critical paths
Use hard failure when correctness or security is at stake
Define SLOs early: They drive redundancy and fallback design

Summary

Failure-first design thinking means:

Design for failure before scale
Identify failure modes and blast radius for key components
Choose graceful degradation or hard failure based on criticality
Use SLOs and error budgets to drive architectural decisions
Detect and recover automatically where possible

FAQs

Q: When is graceful degradation not worth it?

A: When the fallback is confusing (showing wrong data), when implementation cost is high for low-value feature, or when hard failure is simpler and acceptable.

Q: How do I introduce SLOs if we don't have them?

A: Start with one critical SLI (e.g., availability or latency). Measure current baseline. Set a target. Use it to justify one architectural change. Expand from there.

Q: What's the most common failure mode teams miss?

A: Cascading failure: one component fails, others overload and fail. Design for backpressure, timeouts, and circuit breakers to prevent cascade.

Keep exploring

Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.

View All Topics Practice System Design →

Quiz Practice Next