Design Thinking
Evolutionary Design
Design for change, not perfection. Migration strategies, schema evolution, backward compatibility, incremental rollouts, and feature flags. Staff-level thinking.
Staff engineers don't design for the present—they design for change. Systems evolve: requirements shift, scale grows, technology improves. Evolutionary design means building so that change is possible without catastrophic rewrites. This is staff-level thinking: planning for migrations, schema evolution, and incremental rollout before you need them.
Designing for Change, Not Perfection
The Reality
- Requirements change: Product pivots, new features, new constraints
- Scale changes: 10x growth forces different architecture
- Technology changes: New databases, frameworks, platforms emerge
- Org changes: Teams split, ownership shifts, Conway's Law applies
Evolutionary Design Principles
- Assume change: Don't optimize for today's snapshot
- Minimize lock-in: Avoid decisions that are hard to reverse
- Clear boundaries: Modular design makes replacement possible
- Version everything: APIs, schemas, configs
- Feature flags and gradual rollout: Ship change incrementally
Migration Strategies: Monolith to Microservices
Strangler Fig Pattern
Gradually replace monolith by routing new functionality to new services while keeping old code running.
Steps:
- Identify a bounded context to extract (e.g., "notifications")
- Create new service with the same interface as the monolith's module
- Route new traffic to new service via feature flag or routing rule
- Dual-write or sync data as needed
- Migrate reads to new service
- Migrate writes to new service
- Decommission old code in monolith
Parallel Run
Run old and new systems in parallel, compare results, switch when confident.
- Use case: Critical path (payments, orders) where errors are costly
- Cost: 2x infra during migration
- Benefit: Validation before cutover
Database Migration Strategies
| Strategy | Downtime | Risk | Use Case |
|---|---|---|---|
| Big bang | Yes | High | Rarely, small datasets |
| Dual-write, then cutover | Minimal | Medium | Most common |
| Change data capture (CDC) | None | Low | Large, high-traffic |
| Read replicas, flip | Brief | Low | Read-heavy |
| Logical replication | None | Low | Postgres, etc. |
Real Example: Netflix
Netflix migrated from monolith to microservices over years. They used:
- Strangler fig for most services
- Chaos engineering to validate resilience
- Feature flags to route traffic gradually
- Multiple phases: Not one big migration, many small ones
Schema Evolution & Backward Compatibility
The Challenge
- Schema changes are inevitable: new fields, renames, type changes
- Backward compatibility: Old clients must work with new schema
- Forward compatibility: New clients must work with old schema (during rollout)
Strategies
Additive changes (safe):
- Add optional fields
- Add new tables/collections
- Add new endpoints
Breaking changes (risky):
- Remove fields
- Change types
- Rename fields
- Change semantics
Handling Breaking Changes
- Versioned APIs:
/v1/users,/v2/users. Old clients stay on v1. - Deprecation period: Announce removal, give clients time to migrate, then remove
- Dual-write: Write to both old and new format during transition
- Expand-contract: Add new field (expand), migrate consumers, remove old (contract)
Example: Adding a Required Field
Wrong: Add required field, deploy. Old clients fail.
Right:
- Add field as optional. Deploy.
- Backfill data. Ensure all records have value.
- Make required in new version. Old API still accepts without it (default).
- Migrate consumers to send it.
- Eventually remove old API version.
Real Example: Stripe API
Stripe versions APIs (/v1/, 2023-10-16, etc.). They add fields additively. Renames or removals go through deprecation. Old versions supported for years.
Incremental Rollouts and Feature Flags
Why Incremental?
- Reduce risk: One bad deploy doesn't affect everyone
- Validate in production: 1% traffic can surface issues
- Easy rollback: Turn off flag, no redeploy
- A/B testing: Compare old vs new behavior
Rollout Strategies
| Strategy | Use Case | Rollback |
|---|---|---|
| Percentage rollout | 1% → 10% → 50% → 100% | Reduce % |
| Canary | New version for one server/group | Route back |
| User segment | Internal users, beta users first | Exclude segment |
| Geographic | One region first | Route away |
| Kill switch | Feature flag to disable | Flip flag |
Feature Flags in Design
When designing, consider:
- Where do we need flags? New code paths, experiments, migrations
- How do we clean up? Flags have cost: complexity, tech debt
- Who controls flags? Eng, product, ops
- What's the blast radius? One flag or many?
Senior Insight
"Design the rollout before you design the feature. If you can't roll it out incrementally, you'll either delay launch or risk a big-bang deploy. Both are costly." — Plan for rollout as part of the design.
Case Studies: Netflix, Spotify, Stripe
Netflix
- Migration: DVD to streaming, datacenter to cloud
- Approach: Phased migration, chaos engineering, regional rollout
- Lesson: Multi-year journey, not one project. Evolve continuously.
Spotify
- Squad model: Small teams own services. Conway's Law in action.
- Migration: Monolith to "microservices" (they call them something else)
- Lesson: Org structure drove service boundaries. Migration followed team autonomy.
Stripe
- API versioning: Multiple versions live. Deprecation with long runway.
- Schema evolution: Additive changes, expand-contract for breaking
- Lesson: Backward compatibility is a product commitment. Plan for it.
Thinking Aloud Like a Senior Engineer
Problem: "We need to migrate from MySQL to PostgreSQL. 100M rows, high traffic."
My first instinct: "Dual-write, sync, cutover."
But let me think about phases:
- Phase 1: Add PostgreSQL as read replica. Sync via CDC or dual-write. Validate data.
- Phase 2: Route read traffic to PostgreSQL (percentage-based). Compare results.
- Phase 3: Switch writes. Use feature flag: new writes go to both, or only PG with MySQL as fallback.
- Phase 4: Migrate remaining reads. Decommission MySQL.
Rollback: At each phase, we can revert. Phase 2: route back to MySQL. Phase 3: write to MySQL only. No big bang.
Schema: PostgreSQL and MySQL differ. We need an abstraction or adapter. Or: same schema in both during migration. Extra work but simpler.
Downtime: Zero if we do it right. Dual-write, then cutover writes, then cutover reads. Brief inconsistency window? Use distributed transaction or accept eventual consistency for that window.
Best Practices
- Assume migration: Design so components can be replaced
- Version APIs and schemas: From day one
- Prefer additive changes: Avoid breaking changes when possible
- Plan rollout: Percentage, canary, region—before building
- Clean up flags: Technical debt if left forever
Summary
Evolutionary design means:
- Design for change—modular, replaceable components
- Migration strategies—strangler fig, parallel run, phased cutover
- Schema evolution—additive changes, versioning, deprecation
- Incremental rollout—feature flags, percentage rollout, canary
- Avoid big-bang—many small steps, each reversible
FAQs
Q: When is a big-bang migration acceptable?
A: Rarely. Only when: small dataset, low traffic, short downtime acceptable, and no incremental path is feasible. Even then, consider if there's a way to do it in phases.
Q: How do we handle schema changes in a distributed system?
A: Version the schema. Support multiple versions during transition. Use expand-contract: add new, migrate, remove old. CDC can help with async sync.
Q: How many feature flags are too many?
A: When they're hard to reason about, slow down release, or never get cleaned up. Aim to remove flags after rollout. Use a flag management system to track lifecycle.
Apply This Thinking
Practice what you've learned with these related system design questions:
Keep exploring
Design thinking works best when combined with practice. Explore more topics or apply what you've learned in our system design practice platform.