Our Engineering Was Broken. Here's What Actually Fixed It.

TL;DR. A 60-person fintech engineering organization: 1.5 deployments per week, 18-day lead times, 28 percent change failure rate, 4-hour mean time to restore. The first things we tried did not work — new documentation, RFC templates, all-hands communication. What worked was changing the environment rather than asking people to change behavior. Three shifts: automate the validation gates so the safe path becomes the easy path, introduce blameless incident reviews so the information actually surfaces, and give leadership a shared dashboard so the conversation changes from "why aren't you shipping faster?" to "what does this team need?"

The deployment took six hours. Not because it was complex. Because nobody could agree on who was responsible for pressing the button.

That moment happened during a consulting engagement with a fintech company that had grown from 12 engineers to 60 in about 18 months. By most external measures, the engineering organization looked healthy. They had a modern stack, reasonable test coverage, and a team full of smart people. But something was profoundly wrong, and everyone could feel it without being able to name it precisely enough to address it.

This is a story about what we found, what we tried, and what actually worked. It is not a story about brilliant innovation. It is a story about the gap between the process that exists on paper and the one that exists in practice, and what happens when you close that gap systematically starting from the highest-friction points.

Nine people, nine different diagnoses — and one root cause underneath all of them

Before we started looking under the hood, I asked the engineering leadership team to describe the problem in one sentence. I got nine different answers.

The CTO said: "We are not shipping fast enough." A senior engineer said: "We keep breaking things in production." The Head of Product said: "Engineering cannot commit to a date and hold it." An engineering manager said: "The team is burning out." Another senior engineer said: "Nobody knows how anything works."

All of these answers were correct, but they were describing the same root cause from different vantage points. The real sentence was: "We have no shared understanding of how work gets from idea to production."

Every team had its own informal process. Code review standards varied wildly between squads. Some teams ran comprehensive integration tests before merging. Others pushed to main and hoped for the best. Deployment was a manual process documented in a Notion page that was two years out of date and that nobody trusted enough to follow. On-call rotations existed on paper but were not taken seriously until something broke, and then whoever happened to be available became de facto on-call regardless of the rotation schedule.

None of these things were decisions. They were defaults. Nobody had explicitly chosen this system. It had accumulated through a growth phase where moving fast was the priority and process was the thing you added later. Later had arrived, but the process had not.

Measuring the actual state: what the DORA numbers revealed about how bad it was

Before we could address the problems, we needed to establish a baseline. The organization had no DORA metrics. They had a sense that deployments were slow and unreliable, but no measurement of how slow or how unreliable. They had a sense that incidents were frequent, but no systematic count of incidents or measurement of how long they took to resolve.

The first thing we did was instrument the delivery pipeline to measure the four DORA metrics. What we found was consistent with what the team described, but the specific numbers were worse than anyone had estimated. Deployment frequency was about 1.5 deployments per week across the entire engineering organization. Lead time was averaging 18 days from commit to production. Change failure rate was approximately 28 percent, meaning more than one in four deployments caused a degraded service or required a hotfix. Mean time to restore was about 4 hours when incidents occurred.

These numbers were not catastrophically bad for a 60-person organization, but they were in the "low performer" band of the DORA research. For a fintech company where customer trust was directly connected to system reliability, they were creating real business risk.

The measurement step also produced a finding that was not in the DORA metrics: the subjective experience of the engineering team was significantly worse than the objective metrics suggested. Engineers who were producing 1.5 deployments per week collectively felt like they were barely shipping anything. The discrepancy was partly because those 1.5 deployments were often high-drama events that consumed disproportionate team energy. A deployment that takes six hours to coordinate and involves three people arguing creates a much worse experience than three quick automated deployments, even if the total work output is similar.

What we tried first — and why adding process to a structural problem does not work

The first instinct, both from leadership and from our team, was to add process. We drafted a new RFC template for architectural decisions. We updated the deployment documentation with accurate current steps. We scheduled a company-wide engineering all-hands to align on standards and communicate the commitment to change.

It went about as well as you would expect.

The RFC template got used twice and then abandoned. The documentation was accurate for about three weeks before teams started deviating from it again. The all-hands created a lot of nodding and zero durable behavior change. The engineers in the room had heard versions of this talk before and were appropriately skeptical that this time would be different.

The problem was not that people did not know what good engineering practice looked like. They did. The problem was that following consistent practice had no structural support in the environment. The system made it easier to take shortcuts than to follow the right process. The right process involved more steps, more coordination, and more time in the near term, even though the long-term return was better. In an organization under delivery pressure, the near-term cost always wins unless the environment makes the right choice the easy choice.

This is a failure mode I see constantly. Organizations treat an engineering culture problem as an information problem, so they respond with documentation and training. But you cannot document your way out of a structural deficit. The engineers know what to do. The environment makes it difficult to do it. Documentation describing the correct process is not useful in an environment that does not support the process.

The three changes that actually worked — environment over behavior, blame-free reviews, visible data

The first real shift came when we stopped asking people to change their behavior and started changing the environment instead. The principle was simple: make the safe path the easy path, and make the unsafe path require deliberate effort rather than the other way around.

We built a deployment pipeline that enforced the standards automatically. Instead of a manual checklist that required someone to remember 14 steps in sequence and trust that each one had been completed, we automated the validation gates. You could not deploy without passing the test suite. You could not merge without a review from someone outside your squad. The process became invisible infrastructure rather than visible bureaucracy. Nobody was asking engineers to change their behavior. The environment had changed.

The impact was visible within two weeks. Deployment frequency increased because the process was simpler and more reliable. Change failure rate began decreasing because the automated gates were catching the categories of errors that had been slipping through the manual checklist. Most importantly, the deployment stopped being a six-hour coordination exercise and became a 45-minute process that mostly ran itself.

The second change was about accountability without blame. We introduced a lightweight incident review process, but we changed the framing entirely. The question was never "who caused this?" The question was always "what does this failure reveal about our system?" The distinction seems semantic but it is fundamental. The first question produces defensiveness. The second produces investigation.

Within a month of introducing blameless postmortems, engineers were participating openly in incident reviews in a way they had not before. The insights started coming out: the alerts were too noisy. The runbooks were outdated. A specific service had been fragile for a year and everyone knew it but nobody had prioritized fixing it because it had not caused a big incident yet. This kind of information is available in every engineering organization, but it only surfaces when the people who have it feel safe sharing it.

The third change was making the work visible to leadership in a way that created accountability upward rather than only downward. We built a simple dashboard that showed deployment frequency, lead time, and change failure rate for each squad. Not to punish low performers, but to give leadership a shared language for having honest conversations about capacity, technical debt, and prioritization.

When the CTO could see that Squad B had a change failure rate of 28 percent and Squad A had 4 percent, the conversation changed from "why are you not shipping faster?" to "what does Squad B need to become reliable?" That is a fundamentally different conversation. The first assumes that engineers need to work harder or smarter. The second assumes that the environment needs to change. The data shifted the frame.

Six months later: the numbers and what they felt like inside the team

The deployment that used to take six hours with three people arguing takes 45 minutes and runs mostly automated. Deployment frequency went from approximately 1.5 per week across the organization to three or four per day. Change failure rate dropped from around 28 percent to under 6 percent. Mean time to restore dropped from 4 hours to under 40 minutes. Lead time dropped from 18 days to 3 days.

More importantly, the engineers on the team stopped describing their job as "fighting fires" and started describing it as "building things." That shift in language is not cosmetic. It reflects a real change in how they experience their work, and the survey data confirmed it: developer satisfaction scores increased significantly over the same period.

The business outcomes followed. Fewer incidents meant less time spent in crisis response. Faster deployment frequency meant features reached customers sooner and feedback cycles shortened. Lower change failure rate meant fewer emergency hotfixes interrupting planned work. The engineering organization that had been a source of frequent executive concern became a source of competitive advantage.

None of this required brilliant innovation. It required being honest about the gap between the process that existed on paper and the one that existed in practice, and then systematically closing that gap starting with the highest-friction points. The organizations that fix their engineering culture do not do it with inspiration. They do it with infrastructure.

The invisible cost of the broken state — what it was actually costing per year before anyone calculated it

One calculation that engineering leadership rarely makes explicit is the total cost of the broken state. The fintech company I described was paying for that broken engineering organization in several ways that were not visible as line items.

Engineers were spending roughly 30 percent of their time on unplanned work: responding to incidents, debugging production issues that should have been caught in testing, waiting for deployment processes to complete, and reworking changes that had failed. At a 60-person engineering organization with an average fully-loaded cost of $180,000 per year, that 30 percent waste was costing approximately $3.2 million annually before anyone calculated it.

Attrition was elevated. Three senior engineers had left in the previous year. The recruiting cost, onboarding time, and productivity ramp for their replacements was never tallied against the cost of the environment that drove them out. A rough estimate: $400,000 in direct and indirect costs for three departures.

Product velocity was lower than it needed to be. Lead times of 18 days meant that customer feedback took nearly three weeks to act on. Competitors with faster delivery cycles were making product adjustments in two to three days. The revenue impact of that velocity gap is hard to calculate precisely but is real and compounding.

The total cost of the broken state was not visible in any single budget line. It was distributed across engineering time, recruiting, and product outcomes in a way that made it easy to ignore. The measurement work we did to establish a baseline was partly technical and partly financial: making the cost of the current state legible so that the investment in improvement had a clear business case rather than just an engineering quality case.

What the first 90 days look like when you start with measurement rather than solutions

For engineering leaders who want to apply the approach described here, the first 90 days are the most important and the most templated. Not because there is a universal solution, but because the diagnostic work follows a consistent structure regardless of the organization.

The first two weeks are measurement only. Deploy instrumentation for DORA metrics if it does not exist. Run short, structured conversations with engineers at every level about their biggest daily frustrations. Do not propose solutions yet. Do not start any improvement projects yet. The temptation to jump to solutions before completing the diagnostic is the most common reason improvement programs start with the wrong thing.

Weeks three through six are analysis and prioritization. Identify the highest-frequency friction points from the engineer conversations. Cross-reference with the DORA metrics to understand which friction points are producing quantifiable delivery impact. Identify the two or three changes that would have the broadest positive impact and are achievable within the first quarter.

Weeks seven through twelve are execution on those two or three changes. Not a dozen workstreams. Two or three targeted improvements, executed well, with measurement before and after to demonstrate impact. This narrow execution focus is what produces the visible wins that change organizational momentum and build the credibility for the next round of investment.

The organizations that follow this structure tend to have demonstrable, measurable improvements at the 90-day mark. Those improvements are the foundation for the next 90 days. The improvement program is self-sustaining after the first year because the culture of measurement and systematic improvement has been established.

The pattern that repeats — and the single practice that prevents it from accumulating

The fintech company described at the beginning is not unusual. The specific circumstances were particular to their company, but the underlying pattern appears in engineering organizations across industries and organization sizes: processes that were never explicitly designed for the current state of the organization, measurement that does not exist or is not trusted, and organizational will that is present but misdirected.

The organizations that never reach the breaking point are not those that avoid making the mistakes. They are those that detect the drift earlier, through consistent measurement, and correct it before the accumulation becomes a crisis. A team that measures DORA metrics monthly and sees deployment frequency declining will address the cause long before it reaches the state where a deployment requires six people and six hours. The same problem, detected early with measurement, costs a few engineering hours to address. Detected late, after it has become organizational normal, costs months of transformation work.

The single highest-leverage practice for preventing the pattern from developing is establishing honest measurement before it becomes urgent. Not because the measurement will catch every drift, but because the discipline of looking at the data regularly creates a culture where the actual state of the delivery system is known rather than assumed. Organizations that know what is true about their engineering systems make better decisions than those that operate on impressions and anecdotes.

If you are reading this and recognizing your organization in any of the patterns described, the most valuable next step is not a transformation program. It is an honest answer to four questions: how often do you deploy, how long does it take from commit to production, what percentage of deployments cause problems, and how long does it take to recover when they do? Those four answers will tell you more about the state of your engineering organization than any assessment, and they will point directly to where to invest next.

Frequently asked questions

Why did adding documentation and process not fix the engineering problems?

Because the problem was not that engineers did not know what good practice looked like. They did. The problem was that the environment made it easier to take shortcuts than to follow the right process. Documentation describes correct behavior. It does not change the incentive structure that makes shortcuts the path of least resistance. The environment needs to change, not just the documentation. When automated validation gates replace manual checklists, the safe path becomes the easy path and behavior changes without anyone needing to decide to change it.

What is a blameless postmortem and why does it produce different results?

A blameless postmortem asks "what does this failure reveal about our system?" rather than "who caused this?" The first question produces defensiveness and self-protection. The second produces investigation. Engineers participate openly in reviews when the framing is learning rather than accountability. The information that surfaces — noisy alerts, outdated runbooks, fragile services that everyone knew were fragile — is always available in the organization. It only surfaces when people feel safe sharing it. The framing change is what unlocks the information.

What DORA metrics should a 60-person engineering organization expect to hit?

The DORA research benchmarks are: elite performers deploy multiple times per day with change failure rates under 5 percent and mean time to restore under one hour. The fintech organization described in this post was at 1.5 deployments per week, 28 percent change failure rate, and 4-hour MTTR — in the "low performer" band. After six months of the changes described, they reached 3-4 deployments per day, under 6 percent change failure rate, and under 40 minutes MTTR. That trajectory — from low to high performer in six months — is achievable when the root causes are structural rather than technical.

What should the first 90 days of an engineering improvement program look like?

The first two weeks are measurement only — instrument DORA metrics, run structured conversations with engineers at every level about daily friction. No solutions yet. Weeks three through six are analysis: identify the highest-frequency friction points, cross-reference with the metrics, pick two or three changes with the broadest impact achievable in one quarter. Weeks seven through twelve are execution on those two or three changes, with measurement before and after. The organizations that skip the diagnostic phase and start with solutions consistently start with the wrong thing.

If any of this sounds familiar, a Foundations Assessment can help you find where the real gaps are before they cost another six months of drift. The conversation is free and the findings are specific.

For the specific DORA metrics that measure the before and after states described here, read the DORA metrics implementation guide.

For what the structural root causes look like when the platform team itself is the problem, read 5 signs your platform team is stuck in ad-hoc mode.

For the bank version of this story — a 200-person engineering organization that made the same journey in three years — read how a traditional bank transformed its engineering without a big-bang rewrite.

Our Engineering Was Broken. Here's What Actually Fixed It.

Our Engineering Was Broken. Here's What Actually Fixed It.

Nine people, nine different diagnoses — and one root cause underneath all of them

Measuring the actual state: what the DORA numbers revealed about how bad it was

What we tried first — and why adding process to a structural problem does not work

The three changes that actually worked — environment over behavior, blame-free reviews, visible data

Six months later: the numbers and what they felt like inside the team

The invisible cost of the broken state — what it was actually costing per year before anyone calculated it

What the first 90 days look like when you start with measurement rather than solutions

The pattern that repeats — and the single practice that prevents it from accumulating

Frequently asked questions

Related Articles

Why "Innovation Culture" Is Often an Excuse for Not Fixing the Basics

Building Team Culture When Half Your Engineers Are Remote

Stay updated with Clouditive

See where your delivery stands.