Design Thinking
Failure-First Design Thinking
Design for failure before scale. Learn failure modes, blast radius analysis, graceful degradation vs hard failure, and SLOs/SLIs/error budgets in system design.
Senior engineers design for failure before they design for scale. While juniors focus on the happy path, seniors ask: "What breaks first? Who does it affect? How do we degrade gracefully?" This mindset—failure-first design thinking—is a strong signal in system design interviews and in production.
Designing for Failure Before Scale
Why Failure First?
- Failure is inevitable: Disks die, networks partition, services crash. Design assumes it.
- Scale amplifies failure: At 1M users, a 0.1% failure rate is 1,000 unhappy users. At 100M, it's 100,000.
- Recovery time matters: Can you fail fast, detect quickly, and recover automatically?
- User impact matters: Does the whole system go down, or do users see degraded but working service?
The Failure-First Mindset
Before drawing boxes and arrows:
- List what can fail (disk, network, service, dependency, region)
- For each, ask: What's the blast radius? Who is affected?
- Define: What must never fail? What can degrade?
- Design: Isolation, fallbacks, and detection
Failure Modes & Blast Radius Analysis
Common Failure Modes
| Component | Failure Mode | Example |
|---|---|---|
| Database | Unavailable, slow, corrupted | Primary dies, replication lag |
| Cache | Miss, stale, unavailable | Redis OOM, network partition |
| Queue | Backlog, lost messages, slow consumer | Kafka broker down, consumer crash |
| External API | Timeout, 5xx, rate limit | Payment provider outage |
| Network | Partition, latency spike, packet loss | AZ outage,跨-region latency |
| Disk | Full, slow I/O, corruption | EBS failure, disk full |
Blast Radius
Blast radius = How many users, requests, or systems are affected when this component fails?
Example: Single Redis instance
- Blast radius: All users whose requests hit that cache
- Mitigation: Cache is a performance optimization; fall back to DB. Blast radius = increased latency for all, not total failure.
Example: Single database primary
- Blast radius: All write traffic, possibly all reads if no replicas
- Mitigation: Failover to replica. Blast radius = brief write unavailability during failover.
Example: Single-region deployment
- Blast radius: Entire user base if region goes down
- Mitigation: Multi-region. Blast radius = users in one region during regional outage.
Senior Interview Move
Explicitly state: "Let me analyze failure modes. For the database, the main failure mode is primary outage. Blast radius: all writes. Mitigation: synchronous replica + automated failover. For the cache, failure mode is Redis down. Blast radius: latency for all reads. Mitigation: fallback to DB—we accept higher latency, not data loss."
Graceful Degradation vs Hard Failure
Graceful Degradation
System continues to operate with reduced functionality when a component fails.
Examples:
- Recommendations down: Show popular items instead of personalized
- Search down: Show recent items or category browse
- Image CDN down: Serve lower quality or placeholder
- Real-time presence down: Show "last seen" instead of "online"
Hard Failure
System returns an error when a component fails. No fallback.
When to choose hard failure:
- Correctness critical: Payments, medical data, compliance
- No sensible fallback: Auth failure = can't let user in
- Simplicity: Fallback adds complexity; hard fail is clearer
Decision Framework
| Scenario | Graceful Degradation | Hard Failure |
|---|---|---|
| Non-critical feature | ✓ | |
| Revenue-impacting | ✓ (with monitoring) | |
| Data correctness | ✓ | |
| Security/auth | ✓ | |
| Simple system | ✓ (fewer code paths) |
SLOs, SLIs, and Error Budgets in Design
Definitions
- SLI (Service Level Indicator): A measurable value (e.g., latency p99, error rate, availability)
- SLO (Service Level Objective): Target for the SLI (e.g., 99.9% availability, p99 < 200ms)
- Error budget: Allowed failure (e.g., 0.1% = 43 min/month downtime)
How SLOs Shape Design
If your SLO is 99.9% availability:
- Single point of failure? One 4-hour outage blows the budget.
- Need: Redundancy, failover, multi-AZ.
If your SLO is p99 latency < 200ms:
- Cache misses cause DB reads = 50ms+. At scale, p99 can spike.
- Need: Caching strategy, connection pooling, query optimization.
If your error budget is 0.1%:
- You can deploy 10 times with 0.01% failure each, or 1 deploy with 0.1% failure.
- Error budget justifies risk-taking: "We have budget; we can ship this change."
Senior Insight
SLOs force you to make trade-offs explicit. "We can't have 99.99% availability with a single DB. We need multi-AZ + automated failover. That adds cost and complexity. Is 99.9% acceptable?" The business decision drives the architecture.
Case Study: Production Incident That Could Have Been Designed For
Scenario
A popular app had a single Redis instance for session storage. Redis ran out of memory, crashed, and all users were logged out. The team had to scale Redis and restore from persistence. 30 minutes of total logout for millions of users.
Failure-First Design Would Have
- Identified: Session store is single point of failure. Blast radius: all users.
- Designed: Redis Cluster or multi-instance with failover. Or: session in DB with cache—Redis down = slower session lookup, not total loss.
- SLO: If availability SLO is 99.9%, single Redis violates it.
- Mitigation: Eviction policy, memory limits, alerting. Graceful degradation: "Session expired, please log in again" with minimal disruption.
Lesson
The incident wasn't unpredictable. It was a known failure mode (Redis OOM) with high blast radius. Failure-first design would have addressed it before production.
Thinking Aloud Like a Senior Engineer
Problem: "Design a notification system. 100M users, email + push."
My first instinct: "API, queue, workers, email service, push service. Done."
But failure-first: What breaks? Queue backs up. Email provider is down. Push service rate-limits us. Workers crash.
Queue backup: Blast radius = delayed notifications. Mitigation: Dead letter queue, alerting, scale workers. Graceful: Notifications are eventually delivered. Acceptable.
Email provider down: Blast radius = no email for anyone. Mitigation: Fallback provider (e.g., SendGrid + SES). Or: Queue for retry. Graceful: Push still works; email delayed.
Push service rate limit: Blast radius = some users don't get push. Mitigation: Multiple push providers, batching, priority queuing. Graceful: Critical notifications first; rest delayed.
Workers crash: Messages stay in queue. Another worker picks up. At-least-once delivery. Idempotent handlers. No data loss.
SLO: 99.9% of notifications delivered within 5 minutes. Error budget allows 0.1% late or lost. We design for that.
How a Senior Engineer Thinks About Failure
- List failure modes for each component before designing
- Quantify blast radius: Users, requests, revenue
- Define degradation for each failure: What do users see?
- Connect to SLOs: Does this design meet our objectives?
- Design detection: How do we know when something has failed?
Best Practices
- Assume failure: Every component will fail; design for it
- Minimize blast radius: Isolate failures with boundaries
- Prefer graceful degradation for non-critical paths
- Use hard failure when correctness or security is at stake
- Define SLOs early: They drive redundancy and fallback design
Summary
Failure-first design thinking means:
- Design for failure before scale
- Identify failure modes and blast radius for key components
- Choose graceful degradation or hard failure based on criticality
- Use SLOs and error budgets to drive architectural decisions
- Detect and recover automatically where possible
FAQs
Q: When is graceful degradation not worth it?
A: When the fallback is confusing (showing wrong data), when implementation cost is high for low-value feature, or when hard failure is simpler and acceptable.
Q: How do I introduce SLOs if we don't have them?
A: Start with one critical SLI (e.g., availability or latency). Measure current baseline. Set a target. Use it to justify one architectural change. Expand from there.
Q: What's the most common failure mode teams miss?
A: Cascading failure: one component fails, others overload and fail. Design for backpressure, timeouts, and circuit breakers to prevent cascade.
Apply This Thinking
Practice what you've learned with these related system design questions:
Keep exploring
Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.