Skip to main content
Platform Engineering15 min read·

The Microservices Migration That Nearly Killed the Company

An e-commerce team built 47 clean microservices and spent 60% of every sprint on infrastructure problems. The architecture was right. The platform was missing.

The Microservices Migration That Nearly Killed the Company

TL;DR. An e-commerce team split its monolith into 47 clean services and then spent 60 percent of every sprint on infrastructure problems that never existed before. The failure mode was not the architecture. It was building 47 services without the platform to operate them, so distributed complexity arrived with no consistent tooling to manage it.

Twelve months into a microservices migration, an e-commerce company had successfully decomposed their monolith into 47 independent services. The architecture was clean. The team was proud. And they were spending 60 percent of every sprint on infrastructure problems that had never existed before.

Deployments that used to take 20 minutes now required coordinating releases across 8 teams. A bug in the order service would surface as a mysterious timeout in the payment service two hops downstream. The on-call rotation had become a rotating nightmare as engineers struggled to trace failures across dozens of services without adequate tooling. Three senior engineers quit in the same quarter. The CTO who had championed the migration was beginning to question whether they had made a mistake.

I tell this story not because it is unusual. I tell it because it is exactly what I see at least twice a year, and the people going through it are usually doing exactly what the industry told them to do.

Why the original migration decision made complete sense — and what it failed to account for

The original decision to migrate made complete sense on paper. The monolith had become genuinely difficult to work with. Deployments were risky because everything touched everything. A small change to the user profile service required a full regression test of the checkout flow. A new feature in the recommendation system could introduce a bug in the search ranking. New engineers took three to four months to become productive because the system was so interconnected that understanding any part of it required understanding all of it.

The CTO had read the case studies. His team had done the architecture workshops. They started with a reasonable decomposition plan, identified service boundaries along business domains, and even brought in a consultant to validate the approach. The technical design was sound. The bounded contexts were well-defined. The team was experienced with Kubernetes and had the infrastructure skills to run a distributed system.

What the design did not account for was the organizational readiness to operate what they were building.

The operational tax of 47 services without a platform to manage them

A microservices architecture is not just a different way to organize code. It is a fundamentally different operational model, and the cost of that operational model is frequently underestimated until it has been experienced in production.

Each service is its own deployment artifact with its own release cadence, its own health monitoring requirements, its own logging configuration, its own configuration management, its own dependency management, and its own on-call ownership. When you have 47 services, you need 47 things to be observable, deployable, and maintainable, ideally in a consistent way that makes the operational overhead predictable rather than unique for each service.

This company had the services. They did not have the platform.

They were managing deployments with a mix of Bash scripts and tribal knowledge that had been adequate for the monolith but was completely inadequate for 47 independent services. Service discovery was handled differently in staging than in production because the staging environment had been set up by a different team at a different time. There was no standardized way to add a new service, so each team had invented its own onboarding ritual, which meant no two services were configured consistently. Observability was an afterthought: logs existed, but correlating a user-facing error to a specific service failure required significant forensic work by an engineer who knew the system topology.

The architecture had distributed the complexity of their monolith across 47 places without providing the tooling to manage that distributed complexity. If anything, they had replaced one complicated thing with 47 less-complicated things that were all complicated in different ways, without a consistent framework for managing any of them.

When service boundaries do not match organizational structure — the coordination overhead that was supposed to disappear

There was a second dimension to the failure that was less technical and more organizational. The company had decomposed their architecture along technical boundaries that did not perfectly align with their organizational structure. The result was that features requiring changes to multiple services required coordination across multiple teams, which introduced the kind of communication overhead that microservices are supposed to eliminate.

Conway's Law, the observation that systems tend to reflect the communication structure of the organizations that build them, works in both directions. If your organizational structure does not match your service boundaries, you will either refactor the architecture to match the organization or refactor the organization to match the architecture. This company had done neither. The architecture represented the technical ideal. The organization represented historical reality. The coordination cost between them was measured in meeting time and delayed releases.

The teams that navigate microservices migrations most successfully are the ones that address the organizational dimension of the migration explicitly before or alongside the technical work. They ask: if we decompose the architecture this way, which teams will own which services, and does that ownership model create the autonomous delivery capability that motivated the decomposition? If the answer is that features will still require coordination across teams, the architectural decomposition has not solved the original problem.

What a platform for microservices actually is — four basic capabilities, not a sophisticated portal

The term platform engineering has become fashionable to the point of losing precise meaning. What this company needed was not a sophisticated internal developer portal or a Kubernetes-native service mesh. They needed four things that were much more basic.

They needed a golden path for creating a new service that every team would actually use, with sane defaults for logging, health checks, and CI/CD wired in from the start. A new service created through the golden path should be observable, deployable, and on-call-ready from day one without any additional configuration.

They needed a deployment pipeline that worked consistently across all services. Not 47 different deployment scripts, but one deployment process with per-service configuration. A developer who understands how to deploy one service should understand how to deploy any service.

They needed unified observability so that when something broke, an engineer could trace a request across service boundaries without having to SSH into anything or correlate log files manually. Distributed tracing with proper instrumentation means that a user-facing error produces a trace that shows exactly which service calls failed and why, in a format that an engineer who did not write the services can read.

They needed a runbook for common failure modes so that on-call engineers could resolve 80 percent of incidents without escalation. Not a complete operational manual, but the specific documentation of the five most common incident types and exactly what to do when they occur.

None of this is exotic. Building it properly requires dedicating engineering time to infrastructure as a first-class product, with a team assigned to it and a roadmap that is treated with the same seriousness as the product roadmap.

What the three-month remediation looked like and what it produced

The remediation took three months and required stopping most new feature development during that period, which was a difficult conversation with the business. The fix addressed the operational infrastructure that should have been built before the migration.

The golden path reduced the time to add a new service from three days to half a day, and more importantly, it reduced the variation between services so that any engineer on call for any service had a consistent framework for understanding what they were looking at.

Unified logging and distributed tracing, implemented consistently across all services using OpenTelemetry, cut the mean time to resolution on incidents by roughly 70 percent. Engineers who previously spent 45 minutes tracing a user-facing error across logs from multiple services could now trace it in under 5 minutes using the distributed trace. The time savings compounded: every incident resolved faster, every on-call shift less exhausting.

Standardized CI/CD, with one pipeline definition that services inherited and overrode where necessary, eliminated the deployment coordination overhead. Releases that had required cross-team coordination now happened independently, which was the original goal of the decomposition.

The diagnostic question to answer before the migration — not after the 47th service is deployed

The organizations that execute microservices migrations successfully tend to share a common approach. They invest in platform capabilities before they need them at scale, not after the migration is complete. They build the golden path, the deployment automation, and the observability infrastructure during or before the migration, not as remediation work afterward.

The investment required to build this infrastructure before a 47-service deployment is much smaller than the investment required to retrofit it onto an existing 47-service deployment. Services built on the golden path are already consistent with each other. Services that were built independently and then need to be made consistent require individual work for each one.

The diagnostic question to ask before a microservices migration: "If one of these services fails in production at 2am, and the on-call engineer has never seen this service before, can they resolve the incident without escalation?" If the answer requires imagining an ideal observability and runbook infrastructure that does not yet exist, build that infrastructure before you need it.

Why three senior engineers quit — and why compensation did not offset the on-call burden

The three senior engineers who quit during the worst of the migration crisis left for a specific reason that deserves attention. They did not leave because the architecture was wrong. They left because the architecture had created an environment where getting anything done required fighting the infrastructure every day.

Engineers who joined to build product capabilities were spending most of their time debugging infrastructure problems in systems they did not own. The on-call rotation had become so burdensome that engineers were factoring it into their assessment of whether the job was worth doing. The combination of high cognitive load, unpredictable on-call burden, and limited ability to make progress on meaningful work had crossed the threshold where even competitive compensation did not compensate.

By the time the platform was working properly, two of the three engineers could not be re-hired because they had already committed to new roles. The institutional knowledge they took with them was not recoverable.

Architecture documents do not tell you what it feels like to be on call at 2am when the tracing is broken and you are guessing which of 47 services is misbehaving. If you are planning a microservices migration, start with that question and build the infrastructure required to give it a good answer before you need it.

The right architectural question is not "monolith or microservices" — it is "minimum viable change"

One conclusion engineers sometimes draw from stories like this is that microservices were the wrong architectural choice and a return to a monolith would have been better. This conclusion is usually too simple.

The organization in this story needed to scale its engineering teams and its system independently. The monolith they were migrating from had genuine limitations in those dimensions. The problem was not that they chose microservices. It was that they attempted a microservices migration without building the platform capabilities that make microservices operationally viable.

The decision framework that produces the right architectural choice is not "monolith or microservices." It is "what are the specific problems with our current architecture, and what is the minimum viable architectural change that addresses those problems while creating the least new operational complexity?"

For many organizations, this analysis produces an answer that is neither a monolith nor a full microservices architecture. It might be a "modular monolith" where the codebase is structured with clear domain boundaries but deployed as a single unit. Or a small number of services, 5 to 10 rather than 47, organized around major system domains. Or a monolith with a handful of specific capabilities extracted as services because those capabilities need to scale or deploy independently.

The engineering community's pendulum has swung toward microservices and is now swinging back toward simpler architectures for organizations that cannot support the operational complexity. The right answer is always specific to the organization's team size, operational maturity, and actual scale requirements.

Microservices migrations fail when the platform is built after the services, not before

Microservices migrations frequently happen before the organization has developed the operational maturity to support them. The services are created before the platform exists. The deployment automation is built after dozens of services already need it. The observability is retrofitted rather than designed in.

This sequence produces the worst-case outcome: all the operational complexity of a microservices architecture without the platform infrastructure that makes that complexity manageable. The result is the story at the top of this piece: 47 services each requiring manual, bespoke deployment processes, with no distributed tracing and no runbooks.

The sequence that produces the best outcome is opposite: build the platform first, build the first service using that platform, validate that the platform works, then expand to additional services. Each new service benefits from the platform capabilities established for the first one. The migration is slower in the early stages and faster in the later stages. And the engineers managing the resulting system have the tools they need to operate it confidently.

The organizations that have done this well were often not the ones moving fastest. They were the ones that invested the time to do the foundational work correctly before expanding the scope of the migration. That discipline, which looked like slowness from the outside, produced dramatically better operational outcomes and substantially lower total cost.

When a service mesh is the right next step — and when it adds complexity before the foundations exist

One technology decision that frequently comes up during microservices migrations and deserves more careful consideration than it typically receives is whether to introduce a service mesh. Service meshes like Istio, Linkerd, and Envoy provide traffic management, security, and observability capabilities at the infrastructure layer, which can simplify the concerns that individual services need to handle.

The appeal is real: offloading cross-cutting concerns like retries, timeouts, mTLS, and distributed tracing to the mesh reduces the code each service needs to maintain. The operational reality is that service meshes add significant complexity to the infrastructure layer and require engineering expertise to operate well.

For organizations mid-migration and struggling with operational complexity, adding a service mesh is rarely the right first move. The observability problem is better addressed with a simpler, more targeted investment in distributed tracing and structured logging. The reliability concerns around service-to-service communication are better addressed with consistent timeout and retry patterns implemented at the application layer before the infrastructure layer.

Service meshes become valuable when the organization has reached a maturity level where the cross-cutting concerns they address are actually creating significant maintenance overhead. That maturity level is typically reached at 20 or more services with a dedicated platform team that has the capacity to operate the mesh infrastructure. Below that threshold, the operational overhead of the mesh often exceeds the maintenance savings.

The lesson for organizations evaluating service mesh adoption is the same as the lesson for microservices migration generally: sequence matters. Build the foundations that make you operationally ready for the next layer of infrastructure complexity before adding that complexity.

Frequently asked questions

What are the four platform capabilities a microservices migration requires before the first service deploys?

A golden path for creating a new service that every team uses, with logging, health checks, and CI/CD wired in from the start. A deployment pipeline that works consistently across all services — one deployment process with per-service configuration, not 47 bespoke scripts. Unified observability so that an engineer can trace a request across service boundaries without SSH or manual log correlation. And a runbook for common failure modes so that on-call engineers can resolve 80 percent of incidents without escalation. Building this infrastructure before the migration costs significantly less than retrofitting it onto services that were built independently.

What is the Conway's Law problem in microservices migrations?

Conway's Law observes that systems reflect the communication structure of the organizations that build them — and it runs in both directions. If service boundaries do not align with organizational structure, features requiring changes to multiple services require coordination across multiple teams, which reintroduces the overhead that microservices were supposed to eliminate. The organizations that navigate this well address the organizational dimension of the migration explicitly before or alongside the technical work: they ask which teams will own which services and whether that ownership model creates the autonomous delivery capability that motivated the decomposition.

When do service meshes like Istio or Linkerd actually make sense?

When the organization has reached operational maturity where cross-cutting concerns (retries, timeouts, mTLS, distributed tracing) are genuinely creating significant maintenance overhead across services. That threshold is typically 20 or more services with a dedicated platform team that has the capacity to operate the mesh infrastructure. Below that threshold, the operational overhead of the mesh usually exceeds the maintenance savings. The observability problem is better addressed first with distributed tracing and structured logging, and reliability concerns with consistent timeout and retry patterns at the application layer.


If you are mid-migration and recognizing some of this pattern, a Platform Engineering review can help you identify where you are accumulating operational debt before it accumulates further.

MicroservicesPlatform EngineeringArchitectureEngineering Leadership

Found this useful? Share it with your network.

Matías Caniglia

Mat Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

79 articles published

Related Articles

Platform Engineering

The Cost of Not Investing in Platform Engineering

Every hour engineers spend fighting deploy friction, waiting on platform tickets, or repeating slow onboarding is a real cost. A framework for making the number concrete.

Read More →
Platform Engineering

Platform Engineering Consulting vs. Hiring: When Each Makes Sense

An honest analysis for a VP Eng facing the build-the-team-or-bring-in-a-consultancy decision. Cover the 3-6 month critical window, failure modes of each approach, and what a good engagement exit looks like.

Read More →
Platform Engineering

IDP Build vs Buy: A Decision Framework for Engineering Leaders

A structured decision framework covering total cost of ownership, team capacity requirements, vendor lock-in spectrum, what changes at 10 vs 50 vs 200 engineers, and the hybrid path.

Read More →

Stay updated with Clouditive

Long-form analysis on platform engineering, DORA, and AI readiness from Mat Caniglia. Sent when there is something worth reading.

Start here

See where your delivery stands.

A fifteen minute self-diagnostic that scores your platform across DORA metrics, deployment frequency, change failure rate, and cognitive load. No sales call required.

Want to read first? See the Foundations Framework