Real Engineering Stories

How to Migrate From a Monolith to Microservices (Step by Step)

A practical, order-of-operations playbook from a senior architect: when not to split, how to find boundaries, strangler routing, data extraction, events, testin

Advanced52 min read

"We need to move to microservices." I have heard that sentence justify everything from a careful evolution to a multi-quarter outage generator. Microservices are not an achievement badge; they are an operating model: more deployable units, more networks, more partial failure, more data fragmentation—in exchange for team-scale autonomy and independent scaling when boundaries are real.

This guide is the step-by-step migration path I wish more teams had written down before the Jira epic closed. It pairs well with the story The Monolith to Microservices Migration That Almost Failed, which shows what happens when you skip the sequence.

Step 0 — Validate the goal (seriously)

Ask: What problem are we solving?

If the pain is…	Often the better first move is…
Slow releases because code is tangled	Modular monolith, clearer modules, CI/CD
One hot endpoint needs scale	Extract that slice only, or scale the monolith horizontally if feasible
Team conflicts on one codebase	Ownership boundaries in code, then align repos
"Industry best practice"	Stop; measure cost of coordination you are willing to pay

If a modular monolith with strict module APIs and independent test suites fixes 70% of the pain, you may never need twenty services—or you will be ready to split along seams that already exist.

Exit criterion for Step 0: written goals (e.g. deploy frequency, blast radius, scaling target) and non-goals (e.g. "not building a platform team empire in Q1").

Step 1 — Map the system as it runs today

Goal: Base every later decision on how the system actually behaves, not on the diagram from three years ago.

What to produce

Runtime map (who calls what)
- Synchronous: HTTP/RPC paths from clients → monolith → third parties; note depth (e.g. “handler calls 4 internal services and 2 externals”).
- Asynchronous: message topics and queues—producers, consumers, retry and DLQ behavior.
- Schedulers: cron, Airflow-style jobs, nightly batch—what they touch and in what order.
- Failure behavior: timeouts, circuit breakers, fallbacks (even if informal).
Data map (who owns what row)
- List tables (or key collections) with primary writers and main readers (app modules, not people).
- Mark large or fast-growing tables and hot keys (partition risk later).
- Draw foreign keys or logical dependencies across areas (e.g. order → payment → inventory)—these are migration seams or traps.
User and operator journeys
- Trace checkout, signup, refunds, admin flows across modules and data.
- Note where UX is synchronous (user waits) vs background (email, reporting).

How deep to go
Enough that an engineer can answer: “If we split X out, what still breaks if X is down for five minutes?”

Common mistakes

Mapping only services and skipping jobs and batch—the first outage is often “we forgot the nightly job still wrote that table.”
Treating shared DB as neutral; every cross-table join is coupling evidence.

Exit criterion: diagrams and notes are reviewed by on-call and staff; no major path is “unknown.”

Step 2 — Define bounded contexts (domains), not "layers"

Goal: Name cohesive business areas so each future service has a clear reason to exist and language (same words mean the same thing inside the boundary).

What a bounded context is (practically)
A bounded context is where one ubiquitous language and one set of rules apply—for example Catalog vs Cart vs Checkout vs Billing. The word “Order” might mean something slightly different in fulfillment than in payments; that tension is a boundary signal.

Good extraction units (vertical slices)

Move together when product changes (“we’re changing how pricing works”).
Center on an aggregate root (one entity that guards invariants—e.g. a Cart or Shipment).
Prefer few synchronous calls into other contexts; many events out is fine.

Red flags for a bad first slice

Horizontal cuts: “all repositories” or “all DTOs”—no product meaning.
Utility soup first unless it truly blocks everyone.
Hardest workflow (full checkout) because executives asked—without Steps 3–4 in place.

Concrete outputs

A context map (boxes and arrows: upstream/downstream, partners, anti-corruption layers where models differ).
A ranked list of candidate slices with risk, value, coupling score.

Exit criterion: you can defend slice #1 in one sentence: “We extract X because it wins Y and depends lightly on Z.”

Step 3 — Establish platform prerequisites

Goal: Make two services almost as operable as one well-run monolith—otherwise you multiply chaos.

Minimum viable platform (before extract #2)

Identity between services: service accounts, JWT validation (or mTLS policy), who may call what—documented in a small internal spec.
Observability: structured logs (request/correlation id), RED/USE-style metrics per entrypoint, distributed tracing across gateway → monolith → new service.
Deploy: blue/green or canary, one-click rollback, secrets outside git.
API discipline: versioning story, default timeouts, retry only where idempotent, idempotency keys on writes that hit money or inventory.

Why this step is non-optional
Without a shared observability and deploy language, every new service becomes a snowflake, incidents become whodunit across logs, and migrations pause while you build tooling under pressure.

Exit criterion: first extracted service has dashboards, alerts, SLO drafts, and runbook skeleton before it owns revenue-critical path.

Step 4 — Insert the strangler facade (pass-through first)

Goal: Get a stable front door with no behavior change—prove the hop is safe before you route anything exotic.

What you do

Deploy API gateway, reverse proxy, or edge router in front of the monolith so all external traffic hits it first.
Configure transparent forward to existing upstream (same paths, same host routing rules as today).
Move cross-cutting concerns here if helpful: TLS termination, WAF, coarse rate limits, request ID injection.

What to measure

p95/p99 latency versus pre-gateway (expect small overhead; catch keep-alive and connection pool issues).
Error rates unchanged; saturation of gateway and monolith acceptable.

Reading: What Is the Strangler Pattern?.

Common mistakes

Changing routing rules or auth in the same release as “gateway go-live”—if something breaks, you cannot isolate the cause.
Skipping load or soak test on the new path.

Exit criterion: 100% production traffic through the facade week(s) with parity; emergency bypass documented.

Step 5 — Pick the first slice (small, valuable, loosely coupled)

Goal: Win a credibility milestone: one service that proves the platform and reduces real pain, without betting the company.

Ideal first slice checklist

Value: removes a bottleneck, improves scale, or gives a team autonomy on something they fight over.
Coupling: low fan-out (not “this handler calls 15 downstreams synchronously”).
Data: a cluster of tables with one obvious primary writer in code today.
Blast radius: failure or rollback should not brick checkout for everyone if you can avoid it.

Examples that usually work well

Notifications (email/SMS/push), search/index jobs, file/PDF generation, feature flags, recommendation reads—often leaf or read-heavy.

What to avoid as slice #1 (usually)

Core authorize-and-capture payment path without mature platform (unless you truly must).
Anything that requires perfect dual-write on day one across payments and inventory.

Outputs

One-pager: scope, out of scope, success metrics, rollback (route flag back to monolith).

Exit criterion: engineering + product sign the one-pager; on-call knows who owns the new service.

Step 6 — Build the new service; keep the monolith authoritative initially

Goal: Ship running code behind the facade with clear authority: the monolith (or existing DB) still owns truth until you say otherwise.

Typical technical shape

Start read-only or async: new service serves reads from a replica/projection, or consumes events/queues, before it becomes the only writer.
Monolith stays canonical for writes to shared tables until parity + cutover plan exist.
Feature flags in the gateway or app: route internal testers, then canary cohorts.

Database stance

Temporary shared database only with written rules: which module may INSERT/UPDATE which tables.
Avoid “new service writes whatever it wants” to shared tables— that is distributed monolith logic with two deployables.

Operational stance

Same logging, metrics, tracing standards as Step 3; deploy independent of monolith where possible.

Exit criterion: new service deploys to prod, handles flagged traffic or workload, no conflicting writes to shared data without a documented migration plan.

Step 7 — Shadow traffic and prove parity

Goal: Show the new path is semantically equivalent (or intentionally different with sign-off), not “mostly the same.”

Shadow / dark launch

Duplicate production requests to the new implementation without returning its response to users (or return it only in sandbox).
Diff responses: allow tolerances for non-deterministic fields (timestamps, IDs) via schema or rules.
Store diff rate, examples, and top mismatch causes.

Contracts

Consumer-driven contracts or schema tests so a breaking change fails CI before prod.
Golden fixtures from production-like data (sanitized).

Human gate

Product/support review when behavior is meant to change (e.g. rounding, sort order).

Exit criterion: agreed diff threshold met for a sustained window (e.g. two weeks); exceptions documented; runbook says disable new path in one step.

Step 8 — Gradual cutover and explicit rollback

Goal: Move real users in small steps; always know how to go back.

Traffic ramp (example only—tune by risk)

0% → internal → 1% → 5% → 25% → 100% for that route or tenant.
Some teams use per-region or per-segment gates instead of pure percentage.

What to watch

Golden signals: latency, errors, saturation, throttling.
Business metrics: conversion, payments completed, refunds, support tickets with specific tags.
Error budget: pause ramp if burn is unhealthy.

Rollback

Single lever: flag or route table flips traffic back to monolith path for that slice only.
No “revert five repos” as the only story.

Common mistakes

Watching only 500s while wrong business outcomes return 200.
No comms plan—support learns from angry customers first.

Exit criterion: slice at 100% on new path with SLOs stable over your chosen bake period; rollback tested at least once in staging or a drill.

Step 9 — Untangle data (the longest thread)

Goal: Align storage ownership with service ownership so teams can evolve schema and deploy without silent coupling.

Typical sequence

Single writer per table (transition rules in writing—even if still one physical DB).
Schema separation on shared infra (Postgres schemas, separate MySQL DBs on same host) as a stepping stone.
Outbox or CDC so other systems see changes reliably without “RPC then maybe publish.”
Migrate data: backfill, verify counts/checksums, dual-write only with reconciliation jobs and an exit date.
Flip primary write to new store; freeze legacy writes; archive and later drop old tables per retention policy.

Reality checks

Data migration often takes longer than code migration—plan quarters, not sprints, for large domains.
Dual-write without reconciliation is how silent drift starts.

Reading: How Do You Handle Data Consistency in Microservices?.

Exit criterion: for the slice, one primary datastore owner; no undocumented writers; reconciliation green.

Step 10 — Replace synchronous chains with workflows

Goal: Stop encoding multi-step business processes as deep synchronous RPC chains—timeouts and partial failure will eat you.

What to introduce

Events and/or queues for handoffs: “payment captured” → “reserve inventory” → “confirm order.”
Sagas (choreography or orchestration) with timeouts, retries, idempotency, and compensation (refund, release reservation).
Clear workflow states visible in logs/metrics (stuck orders are a design surface).

Rule of thumb
If the user request thread waits on three network hops that each commit business state, you have built a latency stack and ambiguous failure (“did the second hop commit?”).

Exit criterion: critical journeys have documented async flow; no Sevs from “RPC timeout mid-workflow” without recovery path; dashboards show state counts and age.

Step 11 — Scale the organizational side (Conway's law)

Goal: If architecture splits and ownership does not, you get dysfunctional interfaces and blame between teams.

Align deliberately

Team ↔ service ownership: on-call roster, roadmap accountability, SLOs per externally visible capability.
Interfaces: SLAs between teams (not just “best effort REST”).
Communication: RFC-lite for breaking API changes; deprecation windows.

Healthy signals

Teams deploy without daily synchronized releases unless the business truly requires it.
Post-incident reviews improve contracts, not just “more monitoring.”

Unhealthy signals

Every change needs five approvals from five teams—your boundaries may be wrong or APIs are too chatty.

Exit criterion: for each service, named owner, on-call, and documented dependencies; no “everyone owns the monolith module” ambiguity for migrated slices.

Step 12 — Decommission deliberately

Goal: Finish the migration: dead code and dead tables cost money and confuse everyone.

Checklist per retired capability

Traffic: gateway and tracing show zero production requests to legacy path for this slice.
Data: no application writes to legacy tables; batch jobs migrated or cancelled.
Code: remove feature flags and delete monolith modules (not comment-out).
Ops: archive or delete dashboards, alerts, runbooks that only applied to old path.
Docs: internal wiki points to new system only.

Data retirement

Confirm retention/compliance before DROP; often archive to cold storage first.

Celebration metric
Measure deletions (lines, tables, flags)—that is proof you reduced complexity, not just added services.

Exit criterion: PR merges that remove legacy; postmortem-style note: “This slice is done; here is what we learned.”

Summary checklist (printable)

Validate goal; consider modular monolith.
Map runtime + data + journeys.
Draw bounded contexts; pick vertical slice #1.
Platform: observability, deploy, API standards.
Strangler pass-through.
Build new path; monolith authoritative until proven.
Shadow/compare; contract tests.
Canary cutover; rollback lever.
Migrate data with outbox/CDC/reconciliation.
Sagas/async for cross-domain workflows.
Align teams and on-call.
Delete legacy paths and data.

Interview answer (compressed)

"I wouldn't start by splitting repos. I'd validate the problem, strengthen the monolith into modules, put a strangler in front, extract one vertical slice with shadow traffic and parity checks, migrate data with clear ownership and outbox or CDC, replace sync chains with sagas for multi-step flows, and only decommission the monolith path when metrics and tracing prove no traffic uses it—with rollback at every step."

FAQs

Q: How long should a migration take?

A: Honest answer: quarters to years for large systems, because data and behavior parity dominate. If someone promises "six weeks total," ask what corners are being cut.

Q: Should every service have its own database from day one?

A: Per-service datastore is a target for clear boundaries. Many teams start with separate schemas or read replicas and move to full separation as ownership stabilizes. The anti-pattern is multiple services writing the same tables ad hoc.

Q: What is the safest first service to extract?

A: Usually a leaf capability: notifications, file export, search indexing—high value, fewer synchronous dependencies. Avoid the core transaction spine until platform and patterns are proven.

Q: How do I convince leadership to avoid big-bang?

A: Show blast radius: one bad deploy taking down everything vs a canary on one slice. Pair with incident examples (including the migration story on this site) and cost of rework from inconsistent data.

Q: Do microservices require Kubernetes?

A: No. You need orchestrated deploys and observability. K8s is common, not mandatory. A small number of well-owned services on VMs or PaaS can be healthier than a fragile cluster nobody understands.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →