Real Engineering Stories

The Monolith to Microservices Migration That Almost Failed

A production incident during a monolith to microservices migration where service dependencies and data consistency issues caused cascading failures. Learn about migration strategies, service boundaries, and data consistency in distributed systems.

Advanced30 min read

This is a story about how we tried to migrate from a monolith to microservices and almost broke everything. It's also about why migrations are harder than building from scratch, and how we learned to migrate incrementally rather than all at once.

Context

We were running a monolithic e-commerce application that handled orders, payments, inventory, and shipping. As we grew, the monolith became hard to scale and deploy. We decided to migrate to microservices.

Original Architecture:

Technology Choices:

Monolith: Node.js application
Database: PostgreSQL (single database for all services)
Deployment: Single deployment unit

Assumptions Made:

Microservices would be easier to scale
Service boundaries were clear
Data consistency would be maintained

The Incident

Week 1

Order service extracted from monolith

Week 2

Payment service extracted

Week 3

Inventory service extracted

Week 4

Shipping service extracted

Week 4, Monday 10:00 AM

All services deployed, monolith decommissioned

Week 4, Monday 10:15 AM

First order placed through microservices

Week 4, Monday 10:16 AM

Order service created order

Week 4, Monday 10:17 AM

Payment service processed payment

Week 4, Monday 10:18 AM

Inventory service failed to reserve items (service down)

Week 4, Monday 10:19 AM

Order stuck in "processing" state

Week 4, Monday 10:20 AM

Payment processed but inventory not reserved

Week 4, Monday 10:25 AM

Multiple orders stuck, on-call paged

Week 4, Monday 11:00 AM

Identified service dependency issues

Week 4, Monday 12:00 PM

Rolled back to monolith

Week 4, Monday 2:00 PM

Services restored, but 100 orders in inconsistent state

Symptoms

What We Saw:

Stuck Orders: Orders in "processing" state, never completing
Data Inconsistency: Payments processed but inventory not reserved
Service Dependencies: Services failing when dependencies were down
Error Rate: Increased from 0.1% to 15%
User Impact: ~100 orders in inconsistent state, manual intervention required

How We Detected It:

Alert fired when order processing time exceeded threshold
Dashboard showed orders stuck in "processing" state
Service health checks showed inventory service down

Monitoring Gaps:

No alert for data consistency issues
No alert for service dependency failures
No monitoring of distributed transaction state

Root Cause Analysis

Primary Cause: Service dependencies and data consistency issues in distributed system.

What Happened:

Order service created order and called payment service
Payment service processed payment successfully
Payment service called inventory service to reserve items
Inventory service was down (deployment issue)
Inventory reservation failed, but payment already processed
Order stuck in "processing" state
No rollback mechanism for distributed transactions
Data inconsistency: payment processed, inventory not reserved

Why It Was So Bad:

Synchronous dependencies: Services called each other synchronously
No transaction management: No distributed transaction coordinator
No circuit breakers: Services kept calling failing dependencies
No rollback mechanism: Couldn't undo partial operations
Tight coupling: Services still tightly coupled despite separation

Contributing Factors:

Migrated all services at once (big bang migration)
No gradual migration strategy
No distributed transaction management
Services still had synchronous dependencies
No fallback or compensation mechanisms

Fix & Mitigation

Immediate Fix:

Rolled back to monolith: Restored previous architecture
Fixed inconsistent orders: Manually resolved 100 stuck orders
Restored service dependencies: Brought all services back online

Long-Term Improvements:

Gradual Migration Strategy:
- Migrated one service at a time (strangler pattern)
- Kept monolith running alongside microservices
- Gradually migrated traffic to microservices
- Decommissioned monolith only after all services stable
Event-Driven Architecture:
- Switched from synchronous to asynchronous communication
- Used message queue for service communication
- Implemented event sourcing for order state
- Added compensation mechanisms for failed operations
Data Consistency:
- Implemented saga pattern for distributed transactions
- Added idempotency keys for operations
- Added eventual consistency with reconciliation
- Implemented two-phase commit for critical operations
Process Improvements:
- Added gradual migration to deployment process
- Added service dependency monitoring
- Created runbook for migration incidents
- Added rollback procedures for each service

Architecture After Fix

Key Changes:

Event-driven architecture (async communication)
Saga pattern for distributed transactions
Gradual migration (strangler pattern)
Service-specific databases

Key Lessons

Migrate gradually: Don't migrate all services at once. Use strangler pattern—migrate one service at a time, keep monolith running.
Use event-driven architecture: Synchronous service calls create tight coupling. Use message queues for loose coupling.
Handle distributed transactions: Use saga pattern or two-phase commit for data consistency across services.
Design for failure: Services will fail. Design compensation mechanisms and fallbacks.
Monitor service dependencies: Track service health and dependencies. Alert when dependencies fail.

Interview Takeaways

Common Questions:

"How do you migrate from monolith to microservices?"
"How do you handle data consistency in microservices?"
"What is the strangler pattern?"

What Interviewers Are Looking For:

Understanding of migration strategies
Knowledge of distributed system patterns
Experience with service boundaries
Awareness of data consistency challenges

What a Senior Engineer Would Do Differently

From the Start:

Migrate gradually: Use strangler pattern, migrate one service at a time
Use event-driven architecture: Async communication reduces coupling
Implement saga pattern: Handle distributed transactions properly
Design for failure: Add compensation and fallback mechanisms
Monitor dependencies: Track service health and dependencies

The Real Lesson: Migrations are harder than building from scratch. Migrate gradually, use proven patterns, and always have a rollback plan.

FAQs

Q: How do you migrate from monolith to microservices?

A: Use the strangler pattern: migrate one service at a time, keep monolith running, gradually migrate traffic, and decommission monolith only after all services are stable.

Q: How do you handle data consistency in microservices?

A: Use saga pattern for distributed transactions, implement eventual consistency with reconciliation, or use two-phase commit for critical operations. Accept eventual consistency where possible.

Q: What is the strangler pattern?

A: The strangler pattern is a migration strategy where you gradually replace a monolith by building new services alongside it, migrating functionality incrementally, and eventually decommissioning the monolith.

Q: Should you use synchronous or asynchronous communication between services?

A: Prefer asynchronous (event-driven) for loose coupling and resilience. Use synchronous only when you need immediate response and can handle failures.

Q: How do you handle distributed transactions?

A: Use saga pattern (compensating transactions), two-phase commit (for strong consistency), or eventual consistency (for high availability). Choose based on your consistency requirements.

Q: What are common migration pitfalls?

A: Migrating all services at once, tight coupling between services, data consistency issues, service dependencies, and lack of rollback mechanisms.

Q: How do you test microservices migrations?

A: Test in staging with production-like data, use canary deployments, monitor service health, test failure scenarios, and have rollback procedures ready.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design a Notification System

Apply lessons from this story to design a resilient notification system with proper service boundaries.

Medium

Design Instagram

Use insights from microservices migration to design a scalable social media platform.

Hard

Explore More Practice Questions →

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →