← Back to Real Engineering Stories

Real Engineering Stories

The Monolith to Microservices Migration That Almost Failed

A production incident during a monolith to microservices migration where service dependencies and data consistency issues caused cascading failures. Learn about migration strategies, service boundaries, and data consistency in distributed systems.

Advanced30 min read

This is a story about how we tried to migrate from a monolith to microservices and almost broke everything. It's also about why migrations are harder than building from scratch, and how we learned to migrate incrementally rather than all at once.


Context

We were running a monolithic e-commerce application that handled orders, payments, inventory, and shipping. As we grew, the monolith became hard to scale and deploy. We decided to migrate to microservices.

Original Architecture:

graph TB
    Client[Client] --> Monolith[Monolith<br/>Orders, Payments,<br/>Inventory, Shipping]
    Monolith --> DB[(Single Database)]

Technology Choices:

  • Monolith: Node.js application
  • Database: PostgreSQL (single database for all services)
  • Deployment: Single deployment unit

Assumptions Made:

  • Microservices would be easier to scale
  • Service boundaries were clear
  • Data consistency would be maintained

The Incident

Timeline:

  • Week 1: Order service extracted from monolith
  • Week 2: Payment service extracted
  • Week 3: Inventory service extracted
  • Week 4: Shipping service extracted
  • Week 4, Monday 10:00 AM: All services deployed, monolith decommissioned
  • Week 4, Monday 10:15 AM: First order placed through microservices
  • Week 4, Monday 10:16 AM: Order service created order
  • Week 4, Monday 10:17 AM: Payment service processed payment
  • Week 4, Monday 10:18 AM: Inventory service failed to reserve items (service down)
  • Week 4, Monday 10:19 AM: Order stuck in "processing" state
  • Week 4, Monday 10:20 AM: Payment processed but inventory not reserved
  • Week 4, Monday 10:25 AM: Multiple orders stuck, on-call paged
  • Week 4, Monday 11:00 AM: Identified service dependency issues
  • Week 4, Monday 12:00 PM: Rolled back to monolith
  • Week 4, Monday 2:00 PM: Services restored, but 100 orders in inconsistent state

Symptoms

What We Saw:

  • Stuck Orders: Orders in "processing" state, never completing
  • Data Inconsistency: Payments processed but inventory not reserved
  • Service Dependencies: Services failing when dependencies were down
  • Error Rate: Increased from 0.1% to 15%
  • User Impact: ~100 orders in inconsistent state, manual intervention required

How We Detected It:

  • Alert fired when order processing time exceeded threshold
  • Dashboard showed orders stuck in "processing" state
  • Service health checks showed inventory service down

Monitoring Gaps:

  • No alert for data consistency issues
  • No alert for service dependency failures
  • No monitoring of distributed transaction state

Root Cause Analysis

Primary Cause: Service dependencies and data consistency issues in distributed system.

What Happened:

  1. Order service created order and called payment service
  2. Payment service processed payment successfully
  3. Payment service called inventory service to reserve items
  4. Inventory service was down (deployment issue)
  5. Inventory reservation failed, but payment already processed
  6. Order stuck in "processing" state
  7. No rollback mechanism for distributed transactions
  8. Data inconsistency: payment processed, inventory not reserved

Why It Was So Bad:

  • Synchronous dependencies: Services called each other synchronously
  • No transaction management: No distributed transaction coordinator
  • No circuit breakers: Services kept calling failing dependencies
  • No rollback mechanism: Couldn't undo partial operations
  • Tight coupling: Services still tightly coupled despite separation

Contributing Factors:

  • Migrated all services at once (big bang migration)
  • No gradual migration strategy
  • No distributed transaction management
  • Services still had synchronous dependencies
  • No fallback or compensation mechanisms

Fix & Mitigation

Immediate Fix:

  1. Rolled back to monolith: Restored previous architecture
  2. Fixed inconsistent orders: Manually resolved 100 stuck orders
  3. Restored service dependencies: Brought all services back online

Long-Term Improvements:

  1. Gradual Migration Strategy:

    • Migrated one service at a time (strangler pattern)
    • Kept monolith running alongside microservices
    • Gradually migrated traffic to microservices
    • Decommissioned monolith only after all services stable
  2. Event-Driven Architecture:

    • Switched from synchronous to asynchronous communication
    • Used message queue for service communication
    • Implemented event sourcing for order state
    • Added compensation mechanisms for failed operations
  3. Data Consistency:

    • Implemented saga pattern for distributed transactions
    • Added idempotency keys for operations
    • Added eventual consistency with reconciliation
    • Implemented two-phase commit for critical operations
  4. Process Improvements:

    • Added gradual migration to deployment process
    • Added service dependency monitoring
    • Created runbook for migration incidents
    • Added rollback procedures for each service

Architecture After Fix

graph TB
    Client[Client] --> Gateway[API Gateway]
    Gateway --> Order[Order Service]
    Gateway --> Payment[Payment Service]
    Gateway --> Inventory[Inventory Service]
    Order --> Queue[Message Queue<br/>Event-Driven]
    Payment --> Queue
    Inventory --> Queue
    Queue --> Saga[Saga Coordinator]
    Order --> OrderDB[(Order DB)]
    Payment --> PaymentDB[(Payment DB)]
    Inventory --> InventoryDB[(Inventory DB)]

Key Changes:

  • Event-driven architecture (async communication)
  • Saga pattern for distributed transactions
  • Gradual migration (strangler pattern)
  • Service-specific databases

Key Lessons

  1. Migrate gradually: Don't migrate all services at once. Use strangler pattern—migrate one service at a time, keep monolith running.

  2. Use event-driven architecture: Synchronous service calls create tight coupling. Use message queues for loose coupling.

  3. Handle distributed transactions: Use saga pattern or two-phase commit for data consistency across services.

  4. Design for failure: Services will fail. Design compensation mechanisms and fallbacks.

  5. Monitor service dependencies: Track service health and dependencies. Alert when dependencies fail.


Interview Takeaways

Common Questions:

  • "How do you migrate from monolith to microservices?"
  • "How do you handle data consistency in microservices?"
  • "What is the strangler pattern?"

What Interviewers Are Looking For:

  • Understanding of migration strategies
  • Knowledge of distributed system patterns
  • Experience with service boundaries
  • Awareness of data consistency challenges

What a Senior Engineer Would Do Differently

From the Start:

  1. Migrate gradually: Use strangler pattern, migrate one service at a time
  2. Use event-driven architecture: Async communication reduces coupling
  3. Implement saga pattern: Handle distributed transactions properly
  4. Design for failure: Add compensation and fallback mechanisms
  5. Monitor dependencies: Track service health and dependencies

The Real Lesson: Migrations are harder than building from scratch. Migrate gradually, use proven patterns, and always have a rollback plan.


FAQs

Q: How do you migrate from monolith to microservices?

A: Use the strangler pattern: migrate one service at a time, keep monolith running, gradually migrate traffic, and decommission monolith only after all services are stable.

Q: How do you handle data consistency in microservices?

A: Use saga pattern for distributed transactions, implement eventual consistency with reconciliation, or use two-phase commit for critical operations. Accept eventual consistency where possible.

Q: What is the strangler pattern?

A: The strangler pattern is a migration strategy where you gradually replace a monolith by building new services alongside it, migrating functionality incrementally, and eventually decommissioning the monolith.

Q: Should you use synchronous or asynchronous communication between services?

A: Prefer asynchronous (event-driven) for loose coupling and resilience. Use synchronous only when you need immediate response and can handle failures.

Q: How do you handle distributed transactions?

A: Use saga pattern (compensating transactions), two-phase commit (for strong consistency), or eventual consistency (for high availability). Choose based on your consistency requirements.

Q: What are common migration pitfalls?

A: Migrating all services at once, tight coupling between services, data consistency issues, service dependencies, and lack of rollback mechanisms.

Q: How do you test microservices migrations?

A: Test in staging with production-like data, use canary deployments, monitor service health, test failure scenarios, and have rollback procedures ready.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.