DevOps16 min read·December 20, 2024

DORA Metrics: What They Are, What They Miss, and How to Use Them Well

DORA metrics are the best available measure of software delivery health. Here's how to implement them without turning them into a performance management trap.

DORA Metrics: What They Are, What They Miss, and How to Use Them Well

The most common mistake I see with DORA metrics is not failing to implement them. It's implementing them correctly and using them wrong.

A VP of Engineering shows their board a chart demonstrating that deployment frequency has increased from 8 per month to 45 per month. The board is impressed. The engineering team is exhausted. Change failure rate is 22% and climbing. Nobody brought that chart to the board meeting.

DORA metrics are a navigation tool. Like any navigation tool, they're useful when you understand what they're measuring and dangerous when you treat them as a scorecard.

What the Four Metrics Actually Measure

The DORA research team at Google identified four metrics that, taken together, give a strong signal about an engineering organization's delivery health. They're worth understanding precisely rather than just naming.

Deployment frequency is how often you successfully release to production. The word "successfully" matters, a deployment that breaks production shouldn't count. High performers deploy on demand, multiple times per day. Low performers deploy monthly or less frequently. The metric tells you about the size of your batch changes and the maturity of your deployment process.

Lead time for changes is the time from a code commit to that code running in production. This captures everything in the delivery pipeline: review time, build time, test time, deployment approvals. Long lead times indicate bottlenecks somewhere in the pipeline. Short lead times indicate a smooth, automated flow from code to production. Elite performers measure this in under an hour.

Change failure rate is the percentage of deployments that result in a degraded service or require remediation. This is the check on deployment frequency, you can deploy constantly if you don't care about breaking things. A high failure rate with high deployment frequency is not a good sign. Elite performers have failure rates under 5%.

Mean time to restore (MTTR) is how long it takes to recover from a failure. This measures your observability maturity, the quality of your runbooks, and the effectiveness of your on-call process. Teams with good observability and practiced response processes recover in under an hour. Teams without these capabilities can spend days on incidents that should take minutes.

The Correlation That Matters Most

The DORA research consistently shows that these four metrics are correlated with each other in a specific way: the organizations that are best at deployment frequency also tend to have the lowest change failure rates. This is counter-intuitive but well-established.

The reason is that small, frequent deployments are inherently less risky than large, infrequent ones. A deployment that contains two changed files is far easier to roll back, debug, and understand than a deployment that contains two months of accumulated changes across dozens of services. The fear that drives organizations to large batch deployments, "we can't deploy more often because we might break something", is precisely backwards. The large batch is what creates the fragility.

This is one of the most important insights in software delivery research and one of the most commonly misunderstood. If your team is deploying quarterly because it's "too risky" to deploy more often, the deployment process is the risk, not the deployment frequency.

The 2025 Benchmark Update

The DORA 2025 report updated the performance tier benchmarks that engineering leaders use to contextualize their metrics. The elite tier continues to pull away from the median in all four metrics, which is consistent with the compound improvement pattern observable in previous years' data.

Elite performers in 2025 deploy multiple times per day with change failure rates below 5% and restoration times under 30 minutes. The median organization deploys one to four times per month, has change failure rates between 15% and 45%, and takes between one hour and one week to restore service after an incident.

The gap between elite and median has widened consistently since 2018. The organizations that made systematic investments in delivery practices starting several years ago are now in a compounding return phase: each improvement makes the next improvement easier. The organizations that deferred those investments face a widening capability gap that is increasingly expensive to close.

The benchmarks are context for understanding relative position, not targets. An organization that has never measured these metrics has more urgent work to do than calibrating itself against industry percentiles. The first priority is establishing an honest baseline. The second is identifying which metric is most constrained. The benchmark data is useful for calibrating expectations once those two things are done.

The Implementation Trap

Most DORA implementations fail in a predictable way. Leadership discovers the metrics, mandates that teams track and improve them, and creates a dashboard. Teams respond by optimizing for the metrics rather than the underlying behaviors the metrics are supposed to measure.

Deployment frequency improves because teams start deploying more often, including meaningless changes that have no user impact, to inflate the count. Change failure rate appears to improve because teams adopt a narrow definition of "failure" that excludes incidents they'd rather not report. The metrics look better. The delivery capability is unchanged.

This is Goodhart's Law in practice: when a measure becomes a target, it ceases to be a good measure.

The way to avoid this is to be explicit about what the metrics are for. DORA metrics should be used to identify constraints and guide investment decisions. They are not a performance management tool for individual engineers or teams. When engineers fear that their metrics will be used against them, they will optimize for appearance over reality. When they understand that the metrics are tools for identifying where to invest, they tend to report honestly and engage with improvement efforts.

The distinction is not subtle. It requires explicit communication from engineering leadership about how the data will and won't be used, and consistent behavior that matches the stated intent.

The Definition Problem

Before the implementation trap, there is a more fundamental problem: most organizations have not established clear, agreed definitions of what each metric means in their specific context.

What counts as a "deployment"? Is it a merge to main? A release to the production environment? An update to a specific service? Organizations with many services and multiple deployment targets often find that different teams count "deployments" differently, making the aggregate metric meaningless.

What counts as a "failure"? Is it any production incident? Only P0 and P1 incidents? Incidents that required a rollback? An incident that caused customer-visible degradation but was resolved with a forward fix? The definition chosen changes the metric significantly.

What counts as the start of "lead time"? The commit that introduced the change? The point when the pull request was opened? The point when a ticket was created?

These definitional questions should be settled before baseline measurement begins. They do not have universally correct answers. The right answer is the one that is applied consistently across the organization and that measures the thing you actually care about. A team that has perfect metric clarity but a slightly imperfect definition will improve faster than a team with perfect theoretical knowledge but inconsistent measurement.

Starting From Where You Are

Organizations that are new to DORA measurement often feel pressure to establish baseline numbers quickly and then show rapid improvement. This pressure produces the metric gaming described above.

A more durable approach is to focus the first 90 days entirely on measurement accuracy. Before you try to improve your deployment frequency, do you have a reliable definition of "deployment" that's applied consistently? Before you try to improve MTTR, do you have consistent incident tracking that captures all production degradations, not just the severe ones?

Getting the measurement right is unsexy and undervalued. But improvement against inaccurate baselines is not improvement. It's theater. The teams that make real, sustained progress on DORA metrics are the ones that invested in measurement discipline before they invested in metric improvement.

After 90 days of honest measurement, the constraints typically become obvious. The lead time metric usually reveals the slowest step in the pipeline. The change failure rate usually reveals which services or which parts of the codebase are fragile. MTTR reveals gaps in observability or runbook quality. Each of these is an investment decision, not an indictment.

Adding the Fifth Metric: Reliability

The 2021 DORA report introduced a fifth metric, reliability, defined as the degree to which teams met or exceeded their service-level objectives. This metric has been incorporated into the framework alongside the original four in subsequent research.

Reliability as a DORA metric captures something the original four do not: the quality of the output, not just the speed and stability of the delivery process. A team can have excellent deployment frequency, low lead time, low change failure rate, and fast MTTR while still consistently failing to meet the performance and availability expectations of their users.

The practical implementation of this metric requires that teams have defined service level objectives in the first place. Many do not. The exercise of defining what "reliable enough" means for each service, and then measuring adherence to that definition, is valuable independently of the DORA framework. It forces explicit conversations about what the engineering organization is optimizing for and creates accountability to outcomes rather than activities.

Using the Benchmarks Honestly

DORA publishes annual benchmark data that segments organizations into four performance tiers: low, medium, high, and elite. These benchmarks are useful context but not goals. Your goal is not to reach the elite tier by a particular date. Your goal is to remove the constraint that is currently most limiting your team's ability to deliver reliably.

If your deployment frequency is low but your change failure rate is already good, the investment priority is deploying more confidently. If your change failure rate is high, the priority is test coverage and reliability improvements before you speed anything up. If your MTTR is high, the priority is observability and incident response before you try to deploy more frequently.

The metrics work together. Improving them in the wrong sequence can make things worse, faster deployments into a system with poor observability and high failure rates is not progress.

The Organizational Conversation That DORA Data Changes

One of the most underappreciated benefits of consistent DORA measurement is what it does to the conversations between engineering leadership and business leadership.

Without delivery metrics, engineering is a black box to the business. Features go in, products come out, and the quality of the delivery process is assessed based on whether deadlines are hit and incidents occur. This creates an evaluation framework that systematically undervalues investment in engineering infrastructure, because the benefits of that investment are not visible in the language leadership uses to evaluate engineering work.

DORA metrics create a shared language. When the CTO can show the CFO a chart demonstrating that lead time decreased from 18 days to 3 days over 12 months, and can connect that improvement to the reduction in hotfix work that was consuming 40% of engineering capacity, the investment in CI and test reliability that drove the improvement becomes visible as a business investment with a return. The conversation shifts from "why does engineering need more budget?" to "here is what the last investment produced and what the next investment would produce."

This conversation is easier to have when the measurement discipline exists. It is almost impossible to have productively when the engineering organization is reporting activities rather than outcomes. The 47-page quarterly engineering report that describes everything the team built but contains no information about the health of the delivery system is not informing business decisions. It is creating the appearance of transparency without the substance.

The Cadence for Improvement

The organizations that improve DORA metrics consistently tend to operate on a specific cadence: measure weekly, review monthly, set targets quarterly, assess strategy annually.

Weekly measurement provides the data granularity needed to see the effect of specific changes. When a new CI configuration is deployed, does the lead time metric move in the following week? When a flaky test is fixed, does the change failure rate improve? These questions are only answerable with weekly-resolution data.

Monthly reviews allow the patterns to emerge from the weekly noise. A single data point is often not meaningful. Four weekly points can show a trend. The monthly review is where the team looks at the trend and decides whether an improvement is working or needs adjustment.

Quarterly targets provide the medium-term direction that daily work connects to. "We want to reach a change failure rate below 8% by end of Q3" gives the team a specific, achievable milestone that is close enough to feel motivating but distant enough to require sustained effort.

Annual strategy assessment is where the organization looks at the full-year picture, benchmarks against industry data, and decides where the next major investment in delivery capability should go. Is the constraint now in deployment frequency? In observability? In test coverage? The annual assessment answers this question with a year of data rather than with a moment-in-time impression.

This cadence produces consistent improvement without the organizational overhead of transformation programs. It becomes part of how engineering leadership operates rather than a special initiative that competes for attention with other priorities.

The Team-Level vs. Org-Level Measurement Question

A recurring tension in DORA implementation is whether to measure at the team level or the organizational level. Both have value and significant limitations.

Org-level aggregation hides variation. An organization where two elite teams and five low-performing teams average to a "medium" performer classification is not a medium performer. It has a structural problem where most teams are low-performing and two teams are carrying disproportionate load. The aggregate metric obscures this and prevents targeted intervention.

Team-level measurement surfaces this variation but creates a different risk: competitive dynamics between teams that undermine collaboration. When teams know their DORA metrics will be compared to other teams, they may make decisions that optimize their individual metrics at the cost of cross-team outcomes. A team that avoids taking on risky integrations with other teams' services is protecting its change failure rate at the cost of the overall product quality.

The resolution that works best is team-level measurement used for diagnostic and investment purposes, not for comparative ranking. The question "why does Team A have a change failure rate 3x higher than Team B?" should produce an investment decision to address Team A's specific constraint, not a ranking that implies Team B is doing something admirable that Team A is failing to replicate. The constraint that produces Team A's high failure rate may be completely different from anything Team B has addressed, and the intervention needs to match the constraint.

Leadership that consistently uses DORA data to direct investment rather than to rank teams builds the trust that allows teams to report honestly. That honest reporting is what makes the data useful for decision-making. Lose the honesty and you lose the data's value, regardless of how sophisticated the measurement infrastructure becomes.


If you want help establishing a reliable DORA baseline and identifying your highest-leverage improvement, a DevOps assessment produces specific findings in two to three weeks.

DORA MetricsEngineering MetricsDevOpsEngineering Performance

Found this useful? Share it with your network.

Matías Caniglia

Matías Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

28 articles published

Related Articles

DevOps

The Automation Work Most Engineering Teams Keep Deferring (And Shouldn't)

There's a category of automation work that teams acknowledge as valuable but never prioritize. Here's why it matters and how to stop deferring it.

Read More →
DevOps

What the 2025 DORA Report Actually Says About AI and Platform Engineering

The 2025 State of DevOps report has real findings that most summaries miss. Here's what matters for engineering leaders making decisions right now.

Read More →
DevOps

What Netflix's Engineering Model Actually Teaches Us About Delivery

The Netflix engineering story gets retold constantly, often incorrectly. Here's what their model actually demonstrates and what you can apply regardless of your scale.

Read More →

Stay Updated with Clouditive

Get the latest insights on DevOps, Platform Engineering, and Developer Experience delivered to your inbox every Tuesday.

Ready when you are

Three ways to get started.

No matter where you are in the process, there is a right entry point.

Want to read first? See the Foundations Framework →