← Back to Real Engineering Stories

Real Engineering Stories

What Is the Strangler Pattern?

The strangler fig pattern explained for production migrations: how to wrap a legacy monolith, route traffic in slices, evolve data ownership, and retire old code without a big-bang rewrite—plus pitfalls teams hit when they only rename the pattern and not the behavior.

Advanced36 min read

The strangler pattern (or strangler fig pattern) comes from Martin Fowler's analogy: a strangler fig grows on a host tree, eventually replacing it. In software, you incrementally build a new system (or services) around the edges of an old one, route traffic through a stable facade, and retire the legacy path piece by piece until the old implementation can be deleted.

After years of migrations, I treat the strangler not as "we added a gateway once" but as a disciplined product: each slice has entry criteria, parity checks, observability, and an exit (delete dead code, delete old tables, stop paying operational tax).


The problem the strangler solves

Big-bang rewrites fail predictably:

  • Requirements drift while you rebuild.
  • The new system is "done" but missing edge cases only the old system knew.
  • Cutover day concentrates all risk: data, traffic, training, and rollback.

The strangler shrinks each release to a bounded slice: one capability, one cohort, one route—so you can learn in production with a fallback still wired.


Core idea in one diagram

The core idea is stability at the edge, flexibility behind it. Clients, partners, or mobile apps keep calling the same hostnames, paths, and API contracts (or explicitly versioned variants). You introduce a single routing layer—the facade—that decides, per request or per feature flag, whether this traffic is still served by the legacy monolith or has been cut over to a new implementation.

That separation is what makes the migration survivable:

  • Rollback is often "flip routing back to legacy," not "redeploy the world."
  • Learning happens in production with bounded blast radius (one slice, one cohort).
  • Governance has an obvious choke point: auth, rate limits, tracing, and routing policy live at the facade.

The strangler is not only HTTP: the same shape applies when you put a router in front of batch jobs, async consumers, or read models—anywhere a single decision point can steer work to old versus new code paths.

Structural diagram (who talks to whom)

This diagram is the big picture: all traffic goes through the facade first. From there it goes to either the new system or the legacy one (sometimes both when you are shadow-testing). Little by little you move more traffic down the new path. When a slice is fully migrated, you can drop the old path for that slice.

Strangler pattern structural view clients gateway new and legacy backends
Loading diagram...

How to read it

  • Clients should not need to know whether fulfillment logic still lives in a ten-year-old JAR or a new Go service—they still hit your API.
  • Gateway holds cross-cutting concerns: TLS, authentication, coarse rate limits, request IDs for tracing, and route tables (path prefix, header, flag, or weight-based).
  • New vs legacy are replaceable workers behind the same public contract. Your migration is the process of moving more endpoints or more percentage of calls from the bottom-right box to the bottom-left box, then deleting the dead branch.

Facades can be implemented as:

  • API gateway or reverse proxy (path-based or host-based routing, often paired with a service mesh later).
  • Application-level router inside the monolith delegating to extracted modules (early strangler before you have a separate deployable service).
  • Edge or CDN rules for static assets, or a BFF that fans out to old and new subgraphs.

What matters is one place where you can answer: for this request, which implementation is authoritative today, and what do I do when it misbehaves?

Flow diagram (lifecycle of one request)

The first diagram showed who talks to whom. This one shows what happens to a single request after it hits the facade.

Step by step, in plain terms:

  1. A request arrives. The facade looks at the URL, headers, feature flags, or tenant—and decides which rule applies.
  2. Early in the migration that decision is boring: everything still goes to the monolith. You have not cut over anything yet.
  3. Later the rule can send traffic to three places:
    • Legacy — answer from the old system (still the safe default for many slices).
    • New — answer from the new service (after you trust it for real users).
    • Shadow — the user still gets an answer from the path you trust (usually legacy), but in the background you also call the new path and compare the two results in logs or dashboards. The user does not see the experimental answer.

So: one outward-facing response, but during shadow work you may exercise two implementations to prove they match before you flip the switch.

Flowchart of one request: it hits the facade, a routing rule sends it to the monolith, to the new service, or to shadow mode where the new path is tried in the background while the user is still served from the monolith.
Loading diagram...

Figure: request enters the facade, then routing sends it to monolith, new service, or shadow compare (new path tested quietly, user still served from monolith).

Shadow mode in one line: customers keep seeing the old behavior while you verify the new behavior matches in the background—then you migrate when you are confident.


A concrete strangler lifecycle

The phases below are sequential in intent, but Phase 5 often overlaps Phase 2–4 in real programs: you may start moving read models or owned tables before every HTTP route is cut over. The rule is still the same: one slice, one owner, measurable exit criteria.

Phase 0 — Inventory and boundaries

Goal: Know what you are strangling before you buy a gateway SKU. Migration without a map tends to become "extract random repositories."

Activities

  • Map user journeys end to end (checkout, signup, search, admin)—note where latency sensitive steps sit.
  • Inventory API surfaces: public REST/GraphQL, webhooks, mobile SDK contracts, internal RPC that leak into critical paths.
  • Data map: tables and aggregates, who writes today, peak QPS, largest tables, foreign keys crossing domains (coupling alarm).
  • NFRs: RPO/RTO, compliance (PII residency), peak traffic, error budgets.
  • Pick vertical slices (notifications, search, pricing, one checkout step) rather than horizontal layers ("all repositories" or "all DTOs").

Exit criteria

  • Written list of candidate slices ranked by value (revenue, reliability, team autonomy) vs coupling (risk).
  • Single named owner for the first slice (team or staff engineer).
  • Explicit non-goals ("we are not splitting checkout in Q1") to prevent scope creep.

Typical pitfall: Treating org boundaries as architecture boundaries without checking the dependency graph. Conway's law applies, but reality wins when the database says otherwise.

Phase 1 — Insert the facade with no behavior change

Goal: Land the routing layer with zero functional change so you validate ops baseline: latency, TLS, auth plumbing, observability.

Activities

  • Deploy gateway/proxy as transparent pass-through to the monolith (same upstream, same paths).
  • Wire request IDs (correlation) from edge to monolith logs.
  • Move rate limiting and WAF rules to the edge if they were scattered before.
  • Load-test the extra hop; measure p95/p99 delta—often acceptable; sometimes you need regional placement or keep-alive tuning.

Exit criteria

  • Facade serves 100% of production traffic with no business-logic routing yet.
  • Dashboards show golden signals parity with pre-facade baseline.
  • Runbook: how to bypass the facade in an emergency (break-glass DNS or direct monolith).

Typical pitfall: Skipping this phase and mixing first deploy with first route split—when something breaks, you cannot tell if the gateway, the split, or the new service caused it.

Phase 2 — Implement the new path behind the facade

Goal: Have a production-deployed new implementation for one slice, callable from the facade, without trusting it for user-visible traffic yet (until you choose to).

Activities

  • Implement the slice behind the same contract (or versioned API with a deprecation date).
  • Add feature flag or header-gated route in the facade (internal-only first).
  • Run shadow traffic: duplicate requests, compare responses (tolerate acceptable diffs such as timestamps), store diffs for triage.
  • Add contract tests and synthetic checks that fail CI when responses diverge.

Exit criteria

  • New path deploys on the same cadence as the rest of the system (not a skunkworks binary).
  • Shadow diffs below agreed threshold for sustained period; known intentional diffs documented.

Typical pitfall: Behavior drift—the new service reimplements business rules slightly differently. Two sources of truth for the same URL breed production mysteries. Prefer shared kernels or generated contracts only when you accept the coupling trade-off; otherwise one side generates golden fixtures until parity is proven.

Phase 3 — Progressive traffic shift

Goal: Move real user-impacting traffic in small steps, with a big red rollback lever.

Activities

  • Canary by percentage, region, tenant, or customer segment—start with internal or low-risk cohorts.
  • Watch golden signals (latency, errors, saturation) and business KPIs (conversion, payment success, order defects, support volume).
  • Define automatic rollback thresholds where possible (error rate, payment failure spike).
  • Communicate to support what flags mean for that week.

Exit criteria

  • 100% of traffic for that slice on the new path (or explicit long-running dual-path policy documented—rare).
  • No open Sev incidents attributed to the slice for the stability window you defined (e.g. two weeks).

Typical pitfall: Watching only HTTP 5xx while business fails silently (e.g. wrong discount logic with 200 OK). Pair technical metrics with semantic checks (order totals, inventory invariants).

Phase 4 — Make the new path authoritative

Goal: For this slice, the new implementation owns the behavior. Legacy code path is deprecated and removed, not left to rot.

Activities

  • Declare ownership: on-call, SLAs, dashboards for the new service.
  • Disable monolith route in configuration; keep temporary kill-switch to legacy for one release if policy requires.
  • Delete dead monolith modules; remove feature flags that permanently point old traffic.
  • Update documentation and runbooks so new engineers do not "fix bugs" in the deleted path.

Exit criteria

  • No production traffic to legacy for this capability (verified in gateway logs and traces).
  • Code deletion merged—lines removed, not commented out.

Typical pitfall: Permanent dual implementation—two teams fix two copies of the same rule until they disagree in prod. The strangler only pays off when you finish the strangle.

Phase 5 — Data strangulation (often the hard part)

Goal: Align storage ownership with service ownership. HTTP routing without data migration yields a distributed monolith—new deployable units that still fight over the same rows.

Activities — typical progression

  1. Reads from new projection, writes still to legacy (or dual-write with reconciliation) for a bounded time—document maximum lag and who wins on conflict.
  2. Backfill historical data into the new store; verify counts, checksums, spot audits.
  3. Flip writes to the new primary store using transactional outbox or CDC so downstream consumers stay consistent (see data consistency).
  4. Stop writes to legacy tables for this aggregate; archive cold data; drop schema only after legal/retention sign-off.

Exit criteria

  • Single writer per aggregate in production (enforced by code and DB grants where possible).
  • Legacy tables read-only or archived; no hidden batch jobs still mutating them.
  • Reconciliation jobs green for the slice's data pairs.

Typical pitfall: Strangle HTTP first, data never—you now have microservices on paper and a monolithic database in practice, with no independent scaling or blast-radius reduction. Phase 5 is where many programs earn—or lose—the trust of finance and ops.

If you skip disciplined data migration, you get a distributed monolith: new services, old shared database, unclear ownership—the worst of both worlds.


ApproachRisk profileWhen it fits
StranglerLow per slice, cumulative learningLegacy monolith, unclear domains, need rollback
Big-bang rewriteVery high single cutoverSmall system, frozen requirements, rare
Branch by abstractionMedium; code-level toggleRefactor inside monolith before split
Parallel runHigher cost, strong verificationRegulated domains, must prove equivalence

The strangler often combines with branch by abstraction inside the monolith before a service is extracted.


Failure modes teams confuse with "success"

  1. Gateway exists, monolith still owns everything — you shipped infrastructure, not migration.
  2. Perpetual dual implementation — two codepaths forever; bugs diverge.
  3. Strangle HTTP but not data — services call the same DB; you did not get independent deployability.
  4. No parity tests — subtle behavioral differences surface as revenue loss.
  5. No deletion ceremony — old modules linger, on-call still wakes for dead paths.

How I answer in interviews

"The strangler pattern puts a stable facade in front of a legacy system and migrates one vertical slice at a time. New functionality or traffic segments route to the new implementation while the monolith stays as fallback. You prove parity with shadow traffic and metrics, shift traffic gradually, then retire the old path and migrate data deliberately—outbox, CDC, reconciliation—so you don't split services and still share one database forever."



FAQs

Q: Is an API gateway always required?

A: You need a routing layer with a stable entry. That is often an API gateway or reverse proxy; in some setups it is a library router inside the monolith early on. The pattern is the incremental routing, not a specific product.

Q: Can you use the strangler for databases only?

A: You can strangle read paths (new read replica, new CQRS view) while writes stay on the legacy store, then migrate writes. The same slice + verify + cutover discipline applies.

Q: How do you know when a slice is "done"?

A: When no production traffic hits the legacy path, no writes go to legacy tables for that slice, monitoring shows stable SLOs, and on-call agrees the old code path is deleted or archived.

Q: What is the biggest mistake when adopting the strangler?

A: Treating it as a network diagram change instead of a data and ownership migration. The facade is the easy part; who owns the truth is what determines whether you actually improved the system.

Q: Does the strangler work for mobile clients?

A: Yes—often via API versioning and BFF routes so mobile apps do not need big-bang updates. Route new app versions to new backends while old versions sunset on a timeline.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.