Why your DORA metrics are lying to you (and how to fix it)

The conversation goes like this. A CTO walks into a quarterly business review with a dashboard. Deployment frequency is up 40 percent. Lead time is down to four days. Change failure rate sits at 8 percent. The numbers look strong. Then a board member asks a follow-up question: is that deployment frequency measured per service, per environment, or per artifact? The CTO does not know. Neither does the VP of Engineering sitting next to them.

The metrics are real. The definitions behind them are not shared. And three dashboards in the room - PowerBI, Jira, Datadog - each report something different.

This is the state of DORA implementation at most engineering organizations. The four keys are measured. They are just not measured the same way twice. The result is data that looks precise and behaves like anecdote. No one trusts it. Investment decisions revert to intuition. The dashboard exists to fill a slide deck, not to inform a decision.

This piece explains why this happens and what it takes to produce metrics that the C-level can actually defend.

Why typical DORA implementations produce numbers that cannot be trusted

The problem is not that teams fail to measure. The problem is that they measure before they define. Four failure modes appear in almost every implementation I have reviewed.

Deployment frequency without a definition of "deployment"

Is a deployment a merge to main? A release tagged in GitHub? A canary rollout to 5 percent of traffic? A full production release? The answer changes the number by an order of magnitude. A platform team running canary releases continuously will report a deployment frequency ten times higher than a team using the same tooling but counting only full releases.

When each team defines "deployment" based on what their tooling captures by default, the aggregate metric is a fiction. You are averaging things that do not mean the same thing. The number goes up on the dashboard and it tells you nothing about delivery health.

Lead time measured from commit, not from idea

Most lead time implementations start the clock when a commit is pushed. This is what the tooling makes easy to capture. It is also the wrong starting point. The work in progress that accumulates before the first commit - ticket created, refined, picked up, sat in a queue, reviewed in planning - is invisible. It can represent days or weeks of elapsed time on a change that the metric reports as having a lead time of four hours.

The DORA research defines lead time for changes as the time from code committed to code running in production. That is the narrow definition and it is useful. But the broader question the C-level is asking is how long it takes to go from idea to production. That clock starts at ticket creation. A team that optimizes the pipeline but leaves a five-day refinement queue untouched will report strong lead time numbers on a broken system.

Change failure rate that teams can game

Change failure rate is the percentage of deployments that cause a degraded service or require remediation. The definition sounds objective. The classification is not. If the team gets to decide what counts as a failure - and in most implementations they do - the incentive to classify incidents as non-production issues, planned maintenance, or known limitations is structural. The number goes down. The incidents do not.

This is not bad faith. It is rational behavior when the metric is used as a performance signal for the team reporting it. The team that knows their change failure rate is watched by leadership will classify ambiguous incidents conservatively. The number drifts away from reality without anyone lying.

Mean time to restore ignored because incidents are not formally defined

The most common pattern I see: a platform team reports a mean time to restore of two hours. When I ask how many incidents they logged in the last quarter, the answer is three. When I ask what their SLO breach rate was over the same period, they cannot tell me. The three incidents are the ones someone filed a formal postmortem on. The SLO breaches that were handled in Slack and resolved with a quick fix do not exist in the incident record.

Mean time to restore cannot be measured accurately without a corporate definition of what constitutes an incident. Without that definition, what gets measured is the average recovery time for problems that were bad enough to force formal process. The problems that were handled informally - which are often the majority - disappear from the metric. The number reports a partial truth as if it were the complete picture.

When teams use different variants, comparison is anecdote

Each of the four failure modes above is manageable in isolation. The compounding problem appears when you try to use these metrics across teams. If Team A counts canary releases as deployments and Team B counts full releases, comparing their deployment frequency numbers produces a meaningless ranking. If one team starts lead time at ticket creation and another starts it at commit, the four-day difference in their reported numbers may reflect nothing except definitional disagreement.

The State of Platform Engineering Vol 4 2026 found that 29.6 percent of platform teams do not measure their own success at all. The teams that do measure are often doing so in ways that cannot be reconciled. The result is that leadership operates on averages of incompatible inputs and calls it data-driven decision making.

The cross-team comparison problem is particularly damaging at the C-level. The conversations that matter most - where to invest next quarter, which teams need support, what the engineering organization's delivery health actually is - require metrics that mean the same thing across organizational boundaries. If they do not, every comparison is a political act dressed as analysis.

What Signal Integrity means

Signal Integrity is the second pillar of the Foundations Framework. The operational definition is narrow: a metric has signal integrity when it is reproducible, defensible, and describes the current state of the system rather than the state someone wanted to report.

Reproducible means that two engineers running the same query against the same data produce the same number. This rules out metrics that depend on manual classification or per-team interpretation.

Defensible means the definition survives a follow-up question. "Deployment frequency" is not defensible if the CTO cannot answer what counts as a deployment. The metric is only defensible when the definition is documented, applied consistently, and has been explicitly agreed upon across teams.

Describing current state means the metric captures what is happening, not what the team chose to surface. An incident record that only captures formally filed postmortems does not describe the current state of reliability. It describes the current state of process compliance.

DORA 2025 frames this as a precondition for capturing the upside of AI adoption. Teams that lack signal integrity cannot tell whether AI tooling is improving their delivery health or degrading it. The data does not distinguish between the two. As AI tooling becomes a standard part of the engineering workflow, the cost of unmeasured signal rises.

How to build a metric stack the C-level can defend

Four changes produce metrics worth trusting. They are not technically complex. They require organizational discipline that most teams have not applied.

Define deployment as a pipeline artifact, not a merge event

A deployment is a production artifact created by an automated pipeline and released to a production environment. Canary and full release both count. Merge to main does not count. A hotfix applied directly to production counts. The definition must be captured in the pipeline itself, not inferred from git events, so that the metric is produced by the same system regardless of which team you are measuring.

This definition is specific enough to be consistent and broad enough to cover the deployment patterns most engineering organizations actually use. Any team that cannot instrument it against their current pipeline has a pipeline problem that predates the metric problem.

Measure lead time from ticket creation

The DX Core 4 framework (Forsgren, Storey, Zimmermann 2024) distinguishes between code delivery speed and the full change lead time that includes discovery and refinement. Both are worth measuring. The narrow DORA definition (commit to production) belongs in the engineering dashboard. The broader definition (ticket creation to production) belongs in the C-level view.

Measuring from ticket creation requires that tickets exist before work starts, which is a discipline question not a tooling question. Teams that start coding before a ticket is created cannot retroactively measure this. The measurement creates accountability for the intake process.

Define incident with a corporate SLO breach threshold

An incident is any event that causes a service to breach its defined SLO, regardless of whether a formal postmortem was filed. This definition removes the classification incentive. The metric is determined by the monitoring system, not by the team's assessment of severity.

Implementing this requires that SLOs exist and are actively monitored. Many teams have SLOs on paper that are not wired to alerting or reporting. The metric cannot work without the operational foundation. The good news is that building that foundation is the right work regardless of the metric.

Run one data pipeline for all teams

The reconciliation problem is unsolvable if each team owns their own data path to the metric. The only durable solution is a single pipeline that collects deployment events, lead time spans, incident records, and restore timestamps from all teams through the same instrumentation. Differences in team output become visible as differences in delivery behavior rather than differences in how the metric is calculated.

Tools that support this pattern include Datadog's DORA metrics capability, LinearB, and custom OTLP pipelines feeding into Grafana. None of these are endorsements. They are illustrative of the architectural choice that makes cross-team reconciliation possible. The tool is less important than the decision to use one pipeline instead of many.

What the CTO gains from metrics that are actually reliable

The C-level value of trustworthy DORA metrics is not the dashboard. It is the decision quality that the dashboard enables.

A CTO who can answer "how is engineering performing?" with a number that the entire leadership team agrees on can make investment decisions that the organization will not contest three months later. Platform infrastructure spend gets justified against measurable lead time reduction. Reliability investments get justified against documented SLO breach rates. Team-level variation becomes a diagnostic for targeted support rather than a political narrative about which team is best.

The alternative is the situation most engineering leaders are in. The metrics exist. No one fully trusts them. Investment decisions are made on intuition backed by selectively cited data. The dashboard is updated before board meetings and not consulted otherwise.

The Foundations Assessment metric audit

The Horizon phase of the Foundations Assessment includes a metric audit as a formal deliverable. The output is a metric stack design: definitions agreed upon across teams, a single data pipeline architecture, and an instrumentation plan that can be implemented in six to eight weeks.

The metric stack design is the document the CTO brings to the board. It does not just show current numbers. It shows the definitions behind the numbers, the data sources they come from, and the organizational agreements that make them reproducible. A board that asks a follow-up question gets an answer.

If your current metrics cannot survive that follow-up question, the audit is where to start.

Frequently asked questions

Why do DORA metrics vary so much between teams in the same organization?

Because definitions are not shared. Deployment frequency, lead time, change failure rate, and mean time to restore each require a specific definition to be meaningful. When teams define these independently based on what their tooling captures by default, the numbers are not comparable. The variation reflects definitional disagreement, not delivery variation.

What is Signal Integrity in the context of engineering metrics?

Signal Integrity is a metric property: the metric is reproducible, defensible, and describes the actual current state of the system. A metric that depends on manual classification or team interpretation lacks signal integrity. A metric produced by a shared pipeline against a shared definition has it.

How long does it take to fix unreliable DORA metrics?

Agreeing on definitions takes one to two weeks with the right stakeholders in the room. Instrumenting a shared data pipeline takes four to six weeks depending on the existing observability infrastructure. The total timeline from current state to C-level defensible metrics is typically six to eight weeks. The constraint is organizational, not technical.

Do you need expensive tooling to get reliable DORA metrics?

No. The four keys can be instrumented with open source tooling (OpenTelemetry, Prometheus, Grafana) against a shared pipeline. The tooling cost is low. The cost is the organizational work of agreeing on definitions and maintaining a single source of truth. Most teams underinvest in that work and overinvest in tooling.

The AI mirror effect. Why your platform decides whether AI helps or hurts. Signal integrity is the precondition for knowing which side of the mirror you are on.
Counting AI tokens is the new counting commits. What to track instead of activity metrics.
DORA 2025 on AI and platform engineering. The amplifier framing and its implications for metric reliability.

References

DORA. (2025). State of DevOps Report. AI amplifier framing and signal integrity as precondition. https://dora.dev/dora-report-2025/
DORA. (2024). State of DevOps Report. DORA four keys and platform performance correlation. https://dora.dev/research/2024/dora-report/
Forsgren, N., Storey, M. A., and Zimmermann, T. (2024). DX Core 4. A framework for measuring developer experience and delivery health.
State of Platform Engineering Vol 4. (2026). 29.6 percent of platform teams measure nothing.

Why your DORA metrics are lying to you (and how to fix it)

Why your DORA metrics are lying to you (and how to fix it)

Why typical DORA implementations produce numbers that cannot be trusted

Deployment frequency without a definition of "deployment"

Lead time measured from commit, not from idea

Change failure rate that teams can game

Mean time to restore ignored because incidents are not formally defined

When teams use different variants, comparison is anecdote

What Signal Integrity means

How to build a metric stack the C-level can defend

Define deployment as a pipeline artifact, not a merge event

Measure lead time from ticket creation

Define incident with a corporate SLO breach threshold

Run one data pipeline for all teams

What the CTO gains from metrics that are actually reliable

The Foundations Assessment metric audit

Frequently asked questions

Read more

References

Related Articles

Golden paths that developers actually choose (without being forced to)

Cognitive absorption: the platform metric nobody measures

The AI amplifier. What DORA 2025 is actually telling you about your platform.

Stay updated with Clouditive

Two ways forward.