Skip to main content
Platform Engineering11 min read·

5 signs your platform team is stuck in ad-hoc mode

Most platform teams stay in ad-hoc mode because it looks functional from inside. Five signs your platform runs on tribal knowledge, not designed capability.

5 signs your platform team is stuck in ad-hoc mode

TL;DR. Ad-hoc mode does not feel like failure from the inside. Deployments happen, incidents resolve, code ships. What ad-hoc mode cannot do is produce the same output reliably when the engineer who knows the shortcut is on vacation. Five patterns identify it before it becomes a crisis: deployment time varies significantly between engineers on the same service, architectural decisions live in Slack and nowhere else, onboarding takes three weeks instead of five days, the same incidents repeat when the hero engineer is unavailable, and leadership cannot show whether the platform is improving. Each pattern has a specific measurable signal and a specific absorption target.

Ad-hoc is where most platform teams start. It is also where most platform teams stay.

The progression from ad-hoc to managed to defined to optimized is the maturity arc that every platform engineering framework describes. Most organizations recognize the description and locate themselves somewhere in the middle. What they underestimate is how much ad-hoc mode looks like functional operation from the inside. Deploys are happening. Incidents are getting resolved. Engineers are shipping code. The system produces output. What it cannot do is produce the same output reliably, at scale, when the engineer who knows the shortcut is on vacation.

Ad-hoc mode is not a team failure. It is a design state. Every platform starts there because every platform begins as a collection of individual decisions made under time pressure by engineers who knew the constraints at the time. The problem is that ad-hoc mode is invisible until it is not. The five patterns below each identify something that feels normal from inside the system and expensive only when it breaks.

Sign 1: deploy time is a person, not a pipeline

The same codebase. The same pipeline. Five different outcomes.

Alice deploys the payment service in 18 minutes. Bob deploys the same service in 97 minutes. Alice deploys again at 22 minutes. Carlos tries it at 71 minutes. Bob at 108 minutes.

If you have seen this distribution in your deployment logs, your platform is operating on tribal knowledge, not designed capability. The pipeline exists. The canonical path does not reliably produce canonical results. The actual deploy time is a function of who is deploying, what they know that is not in the documentation, which environment-specific quirks they have learned, and which Slack message from six months ago they remember explaining the retry step.

The diagnostic question: can a developer who joined last month deploy the payment service without synchronous help from someone who has done it before? If the answer is no, the platform is a peer, not an absorber. It provides surface area without providing capability.

The root cause in most organizations is that platform capabilities are built by engineers who internalize operational knowledge they never write down. The CI/CD pipeline was built by someone who knows why the third step has a five-minute sleep and what happens if you remove it. That knowledge is not in the pipeline. It is in their head. The platform passes the appearance of capability while depending on a human to complete the capability gap.

A managed platform produces consistent deploy times regardless of who is deploying. The target in the Foundations Framework Horizon phase is that any engineer who can read the deployment documentation can deploy without synchronous assistance. The delta between Alice's 18 minutes and Bob's 97 minutes is the platform's knowledge debt. Measure it. Reduce it. The reduction is not a documentation project — it is an absorption project: identify every reason the deploy time varies, then build the capability that eliminates the variation.

Sign 2: architectural decisions live in Slack and nowhere else

Q1 2025: picked Terraform. Q2 2025: switched to Pulumi. Q3 2025: back to Terraform. Current state: why are we here?

No Architecture Decision Record equals the same mistake, different quarter.

This is not a failure of individual engineers. The engineer who switched from Terraform to Pulumi in Q2 had a reason. The engineer who switched back in Q3 had a reason. Neither reason is accessible to the engineer evaluating the stack in 2026, because neither reason was written down. The stack decision exists in the configuration. The reasoning behind the stack decision is gone.

The cost compounds across every major technical decision your team has made in the last two years. Your service mesh choice. Your secrets management approach. Your environment structure. Your IDP strategy. Each was made for reasons that made sense at the time. Without a written record for each, none of those reasons is accessible. New engineers guess. Experienced engineers forget. The team relitigates the same decisions quarterly.

Based on Clouditive assessment data across twelve engagements, approximately 63 percent of architectural decisions get revisited within six months, and in none of those cases was there a written record of why the original choice was made. That reversal rate is not a measure of bad decisions. It is a measure of invisible decisions. A decision that was right for reasons nobody can see will be reversed by the next person who cannot see the reasons.

A managed platform writes an Architecture Decision Record before building any significant capability. The ADR records the problem, the options considered, the chosen path, the known tradeoffs, and the consequences. It lives in version control alongside the code it describes. When new engineers want to change something, they write a new ADR. The conversation about whether the change is right happens before the code is written, not after it is deployed.

Sign 3: onboarding takes three weeks when the target is five days

Documentation is someone's memory.

21 days to first PR. The target should be 5.

The gap between those two numbers is not a measure of how complex your codebase is. Complex codebases with well-written documentation and clear decision records produce fast onboarding. Simple codebases where critical knowledge lives in the heads of three senior engineers produce slow onboarding, because the new engineer cannot learn from documentation that does not exist. They learn by asking, waiting, and trial and error.

The 21-day number is recoverable from most engineering organizations' historical data. Count the calendar days between a new engineer's first commit to the repository and their first merged PR. Average across the last four new hires. That is your onboarding time. Compare it to the target.

The 5-day target sounds aggressive. It is achievable when the platform absorbs the orientation work that currently falls on the new engineer. A platform that provides a working local development environment via one command, an ADR log explaining every major technical decision, a golden path for the first PR that requires no knowledge not in the documentation, and a deployment procedure that produces consistent results regardless of who runs it — that platform produces 5-day onboarding because the platform does the orientation work, not the senior engineer.

A managed platform treats onboarding time as a platform metric, not a talent metric. If it takes 21 days, the platform is not absorbing the orientation work. Find the five things that consume the most time during onboarding and build absorbers for each. A missing environment variable that the new engineer spends two days diagnosing: write the runbook. A deployment step that requires knowledge not in the documentation: write the ADR and update the procedure. An architecture choice that is surprising without context: write the ADR. The target is not that the senior engineer is more available to new hires. The target is that the new hire does not need the senior engineer for orientation.

Sign 4: the same incidents repeat whenever the hero engineer is unavailable

The hero fixes it. Nobody writes it down.

In every engineering organization I have assessed, there are two or three engineers who resolve the most complex incidents faster than anyone else. They are not smarter or more capable. They have accumulated more incident context through repeated exposure. They remember which service has a race condition under high load. They know that the payment processor returns a 200 with a failure body in certain error states and that the monitoring alert does not catch it. That knowledge was never written down because it was never the right time to write it down.

Three months later, same incident, same panic — except the hero engineer is at a conference, or on parental leave, or has left the company. The team that depended on the hero faces the same incident with none of the accumulated context. They fix it again, paying the same investigation cost the hero paid the first time.

The measurement for this pattern is straightforward: count how many P1 and P2 incidents in the last two quarters were resolved via documented procedures versus via improvised investigation. If the ratio is below 60 percent documented, your incident knowledge is living in people, not in systems.

A managed platform requires a runbook review as part of every incident close. The review does not need to be lengthy — three to five minutes to confirm that the documented procedure either exists and was followed, or was updated to reflect what was discovered. The runbook is not polished prose. It is the minimum information that would let a different engineer resolve the same incident in significantly less time. The goal is that the second occurrence of any P1 costs less than the first.

Sign 5: leadership cannot show whether the platform is getting better — because there is no baseline

No baseline. No delta. No story. If you cannot show the delta, the budget conversation is a guess.

This pattern surfaces when the first four exist. Without deployment consistency metrics, there is no baseline for the deploy variance problem. Without ADR records, there is no history of decision quality. Without onboarding time tracking, there is no improvement trend. Without incident documentation rates, there is no reduction curve.

Platform engineering investment requires justification, and the justification requires before-and-after data. A platform team that has been operating in ad-hoc mode has no before-and-after data because it never instrumented the before state. When budget time arrives, the team can produce anecdotes — "we fixed the payment service deploy," "onboarding is faster than it used to be" — but not data. Anecdotes do not survive CFO scrutiny. Anecdotes do not answer the question "what would we lose if we cut the platform team by half?" Data does.

A managed platform team defines a small set of operational metrics before beginning each quarter's work, measures the baseline at the start of the quarter, and reports the delta at the end. Four metrics cover most of the ground: deploy consistency (standard deviation of deploy times for the same service across the team), onboarding time (days to first merged PR per new engineer), incident documentation rate (percentage of incidents closed with a runbook review), and platform adoption under pressure (percentage of deployments using the canonical path during high-pressure release periods).

These four together give leadership a story: the platform is getting better, here is the evidence, here is what the improvement cost in engineering time, here is what it would cost to lose it.

Why moving from ad-hoc to managed is the hardest step on the maturity arc

The maturity arc runs: ad-hoc to managed to defined to optimized. Every platform team starts at ad-hoc. Moving from ad-hoc to managed is harder than any subsequent transition, because it requires making visible things that were invisible.

Deployment variance was not a metric before. Making it one requires admitting that the variance exists and that the platform is the cause. Onboarding time was not a platform metric before. Making it one requires accepting accountability for something that used to be attributed to engineer learning speed. Incident documentation rate was not tracked before. Starting to track it means confronting the gap.

The hardest move is not technical. It is the decision to say "we measure this now." That decision, made by someone with authority and executed with consistency, is what separates ad-hoc from managed. Not a tooling change. Not a hiring decision. The commitment to make the system's behavior visible.

Once the system is visible, improvements follow naturally because the team can see what to improve. The reason most organizations stay in ad-hoc mode is not that the path forward is unclear. It is that the current state is invisible, and invisible problems feel tolerable.

Frequently asked questions

How do I know if my platform team is operating in ad-hoc mode?

Ask one question: can a developer who joined last month deploy the main production service without synchronous help from someone who has done it before? If the answer is no, the platform is operating on tribal knowledge, not designed capability. The other four patterns — architectural decisions in Slack only, onboarding longer than five days, repeated incidents resolved only by specific heroes, and no baseline data to show improvement — are each independently measurable from data your organization already generates.

What is the 63% architectural decision reversal rate from?

It comes from Clouditive assessment data across twelve engagements, measuring how many significant architectural decisions were revisited within six months of being made. In none of the reversal cases was there a written record of why the original choice was made. The 63 percent rate is not a measure of bad decisions. It is a measure of invisible decisions. A choice that was right for reasons nobody can see will be reversed by the next engineer who cannot see those reasons. An Architecture Decision Record makes the reasoning persistent.

What is a realistic onboarding time target for a platform-mature team?

Five days to first merged PR. That target sounds aggressive and is achievable when the platform absorbs the orientation work that currently falls on the new engineer. The specific absorbers: a working local development environment from one command, an ADR log explaining every major technical decision, a golden path for the first PR requiring no knowledge not in the documentation, and a deployment procedure producing consistent results regardless of who runs it. Twenty-one days to first merged PR means the platform is not doing any of this and the new hire is getting orientation from senior engineers instead.

What four metrics should a platform team track quarterly to prove it is improving?

Deploy consistency (standard deviation of deploy times for the same service across the team), onboarding time (days to first merged PR per new engineer), incident documentation rate (percentage of incidents closed with a runbook review), and platform adoption under pressure (percentage of deployments using the canonical path during high-pressure release periods). These four together give leadership a data story — not anecdotes — about whether the platform investment is producing measurable improvement.


The Foundations Assessment includes a structured diagnostic of all five patterns across the Foundations Framework maturity model. The free Platform Score gives you a radar view of where your platform sits across five dimensions in fifteen minutes.

Platform EngineeringPlatform MaturityPlatform TeamDevOpsEngineering OperationsFoundations FrameworkMat Caniglia

Found this useful? Share it with your network.

Matías Caniglia

Mat Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

79 articles published

Related Articles

Platform Engineering

The Cost of Not Investing in Platform Engineering

Every hour engineers spend fighting deploy friction, waiting on platform tickets, or repeating slow onboarding is a real cost. A framework for making the number concrete.

Read More →
Platform Engineering

Platform Engineering Consulting vs. Hiring: When Each Makes Sense

An honest analysis for a VP Eng facing the build-the-team-or-bring-in-a-consultancy decision. Cover the 3-6 month critical window, failure modes of each approach, and what a good engagement exit looks like.

Read More →
Platform Engineering

IDP Build vs Buy: A Decision Framework for Engineering Leaders

A structured decision framework covering total cost of ownership, team capacity requirements, vendor lock-in spectrum, what changes at 10 vs 50 vs 200 engineers, and the hybrid path.

Read More →

Stay updated with Clouditive

Long-form analysis on platform engineering, DORA, and AI readiness from Mat Caniglia. Sent when there is something worth reading.

Start here

See where your delivery stands.

A fifteen minute self-diagnostic that scores your platform across DORA metrics, deployment frequency, change failure rate, and cognitive load. No sales call required.

Want to read first? See the Foundations Framework