DORA metrics implementation — the definitional discipline that makes them honest
TL;DR. The fix for untrustworthy DORA numbers is not better tooling. It is definitional discipline: writing down what each metric measures, how it is calculated, and which events count, then enforcing that definition across every team and tool. Including automated deployments in deployment frequency, for instance, typically inflates the number three to five times over engineer-initiated code changes.
A CTO walks into a quarterly business review with a dashboard. Deployment frequency is up 40 percent. Lead time is down to four days. Change failure rate sits at 8 percent. The numbers look strong. A board member asks a follow-up question: is that deployment frequency measured per service, per environment, or per artifact? The CTO does not know. Neither does the VP of Engineering sitting next to them.
The metrics are real. The definitions behind them are not shared. And three dashboards in the room — PowerBI, Jira, Datadog — each report something different.
This is the state of DORA implementation at most engineering organizations. The four keys are measured. They are just not measured the same way twice.
The fix is not better tooling. It is definitional discipline: writing down what each metric measures, how it is calculated, and what events count toward it, and enforcing that definition consistently across every team and every tool that reports the number.
Where the definition gaps hide — one for each of the four DORA keys
Each of the four DORA keys has a specific definition problem that recurs across engineering organizations.
Deployment frequency seems simple: the number of deployments to production. Ambiguous in practice. Does a deployment that fails and rolls back count? Different teams answer this differently. If it counts, teams with better rollback automation will show higher deployment frequency than teams that deploy the same code but avoid failed deployments. The metric ends up rewarding worse deployment reliability. Does a deployment of a configuration change count the same as a deployment of application code? Most organizations include both, which means teams with frequent configuration changes show higher frequency than teams with equivalent code delivery but fewer configuration events. Does a deployment that happens automatically — dependency updates, security patches applied by automation — count the same as a deployment initiated by an engineer? Including automated deployments typically inflates deployment frequency by a factor of three to five over code-change-only deployments.
The right definition depends on what you are trying to measure. If you are measuring delivery throughput (how fast does new code reach users), you want deployments of engineer-initiated code changes to production. If you are measuring pipeline reliability (how often does the deployment mechanism run successfully), you want all deployment events. If you are measuring release frequency (how often do customers see new capabilities), you want successful deployments of customer-facing feature changes. None of these is wrong. Mixing them in the same metric is wrong.
Lead time for changes has its own ambiguity. The definition is the time from code commit to code running in production. Does lead time start at the first commit to a feature branch, the commit to main, or the commit that triggers the CI pipeline? For teams using trunk-based development, these are often the same commit. For teams using long-lived feature branches, the difference can be days or weeks. A team that rebases onto main before merging will show different lead time than a team that merges a long-lived branch. Does lead time include the time a change spends waiting for approval? In organizations with change advisory boards or formal change management processes, a production-ready artifact may wait days for an approval window. Including that wait makes the metric sensitive to organizational process. Excluding it makes the metric sensitive only to technical pipeline speed. Both are valid measurements of different things.
Change failure rate — the percentage of deployments causing a degraded service requiring remediation — sounds clear until you ask what counts as a failure. Does a deployment that causes a bug discovered in QA but not yet in production count? Does a deployment causing a performance degradation of 30 percent count? Depends on whether the monitoring system alerts on performance, and whether the alert threshold is calibrated to customer-impacting degradation or to technical metric drift. Does a partial failure — a deployment where 20 percent of users experience errors — count the same as a complete outage? Most dashboards do not distinguish, which misrepresents risk.
Mean time to restore has the same boundary problem at both ends. Detection starts when monitoring alerts, when a customer reports an issue, or when an engineer notices a problem. Organizations with comprehensive monitoring will show shorter MTTR than organizations with the same incident response speed but monitoring gaps, because the incident clock starts later. Restoration ends when the immediate technical fix is applied, when the post-incident validation is complete, or when the monitoring system confirms normal operation. These can differ by hours on the same incident.
How to run a definition audit and produce a measurement protocol that survives a board question
A definition audit takes two to four hours and produces the documentation that makes DORA metrics defensible. Gather three to five engineers who each produce DORA metrics. Have each independently measure the same past quarter using their current method. Compare the results. Resolve the differences by writing down explicit definitions.
The comparison step always surfaces disagreements. In Clouditive engagement experience, independently calculated deployment frequency numbers for the same team vary by 40 to 200 percent depending on how each engineer counted deployments. This is not a measurement failure. It is a definition failure. The engineers are measuring accurately, but they are measuring different things.
The output is a measurement protocol: a written document specifying, for each metric, what events are counted, what events are excluded and why, where the data is sourced, and how the calculation is performed. Any engineer following the protocol should produce the same number from the same raw data.
The protocol is reviewed annually and whenever a significant tooling change occurs. Feature flag adoption, new CI/CD tools, automated dependency management, AI-assisted development: each can shift the metric denominator or numerator in ways that require a definition review.
Why DORA metrics need supplementary signals — and which four pairs catch the most problems
DORA metrics are outcome metrics. They tell you the system's delivery performance from the outside. They do not tell you why the performance is what it is, which investments produced improvements, or whether the improvements are sustainable.
Supplementary signals that triangulate DORA give you the causal layer underneath the outcome metrics.
Paved road compliance under pressure pairs with deployment frequency. High deployment frequency with declining paved road compliance under deadline pressure tells you the frequency is being sustained by developers routing around the platform, which is unsustainable. High deployment frequency with high paved road compliance tells you the platform is absorbing the deployment work and the frequency reflects genuine capability improvement.
Flow state retention pairs with lead time. Improving lead time that comes with declining flow state retention — more context switches per day per developer — tells you the lead time improvement is being purchased with developer attention cost. The metric looks better while the sustainability of the improvement degrades.
Incident documentation rate pairs with MTTR. Improving MTTR with declining incident documentation rate tells you restoration is getting faster because the same hero engineers are responding, not because the organization is getting better at incident response. The improvement is person-dependent and will not survive turnover.
Escaped defect rate pairs with change failure rate. A change failure rate that holds steady while escaped defects increase tells you the monitoring configuration is not catching all failures. The DORA metric is accurate for what it measures. It is not measuring everything that matters.
What honest DORA implementation reveals — a team with elite numbers and a platform problem
When DORA metrics are implemented with definitional discipline and triangulated with supplementary signals, the picture is more complex and more useful than the typical dashboard.
A team I assessed had what looked like elite DORA metrics: deployment frequency above 20 per week, lead time under 2 days, change failure rate under 5 percent. Their MTTR was the only metric outside elite range at approximately 6 hours.
The supplementary signals told a different story. Paved road compliance under pressure was 38 percent: the majority of deployments during release crunches used manual workarounds rather than the standard pipeline. Flow state retention showed developers context-switching an average of once every 22 minutes on standard work days. Incident documentation rate was 12 percent.
The DORA metrics were technically accurate. The team deployed frequently to staging and development environments, which they counted in their production deployment frequency. Lead time was short because engineers made small commits directly to main, bypassing the feature development cycle. Change failure rate was low because the monitoring thresholds were set conservatively and did not catch degradations that customers experienced.
The honest picture was a team with fast technical pipelines and weak platform absorption. The deployment frequency and lead time metrics reflected the pipeline speed. The supplementary signals reflected that the pipeline was being used inconsistently under pressure and that the MTTR problem was caused by inadequate runbooks. The DORA numbers looked like elite performance. The supplementary signals indicated structural platform weaknesses. Both were correct. Neither alone was sufficient.
Frequently asked questions
Why does including automated deployments inflate deployment frequency, and what should be counted instead?
Automated deployments — dependency updates applied by bots, security patches, configuration changes triggered without engineer intent — count events that are unrelated to development throughput. Including them typically inflates deployment frequency by three to five times over engineer-initiated code changes, per Clouditive engagement experience. The right definition depends on what you are measuring: delivery throughput (engineer-initiated code changes to production), pipeline reliability (all deployment events), or release frequency (successful deployments of customer-facing feature changes). Mixing these in the same metric produces a number that is accurate in the narrow sense and misleading in the business sense. The definition audit forces you to choose, and different choices are correct for different purposes.
What is the most common gap found in DORA definition audits?
Independently calculated deployment frequency numbers for the same team vary by 40 to 200 percent depending on how each engineer counted deployments, based on Clouditive engagement data. This is not a measurement accuracy problem. It is a definition problem: the engineers are measuring accurately, but they are counting different things. The definition audit surfaces this by having three to five engineers independently measure the same past quarter using their current method, then comparing. The comparison always reveals disagreements. Writing a measurement protocol — what counts, what does not count, where the data is sourced, how the calculation is performed — resolves the disagreements and produces a number that any engineer can reproduce from the same raw data.
When should a DORA metric definition be reviewed, and what triggers obsolescence?
Annually and after every significant tooling change. Feature flag adoption, new CI/CD tools, automated dependency management, and AI-assisted development each shift what gets counted in the metric denominator or numerator. A deployment frequency definition calibrated before feature flag adoption counted releases that users saw. After feature flag adoption, it may count deployments where the code ships but users see no change. An engineering organization that adopted AI coding assistants in the last year and has not reviewed its lead time definition may be measuring time-to-commit on AI-generated code, which is shorter than time-to-commit on human-written code for different reasons. The review asks: has any tooling change in the last twelve months shifted what this definition captures?
For a deeper read on why DORA metrics specifically fail without Signal Integrity discipline, read Why your DORA metrics are lying to you.
For what the DORA 2025 AI amplifier research adds to DORA implementation — why baseline measurement must come before AI rollout — read the AI amplifier — what DORA 2025 is actually telling you about your platform.
For the bank case study that puts DORA implementation in the context of a full DevOps transformation, read how a traditional bank transformed its engineering without a big-bang rewrite.
The Foundations Assessment includes a full DORA definition audit as part of the Horizon phase, producing a measurement protocol that produces consistent numbers across teams and tools. The free Platform Score surfaces the most common definition gaps in fifteen minutes.

Mat Caniglia
LinkedInFounder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.
79 articles published