Measuring AI Developer Productivity: Why PR Count Is the Wrong Metric
TL;DR. Your developers have AI coding assistants. Leadership wants to know if they're worth it. The first instinct is to count PRs, acceptance rates, or lines of code accepted. These are activity metrics — they measure what the tool consumed, not what it produced. DORA metrics before and after rollout tell you what actually changed: whether lead time dropped, whether deployment frequency increased, whether change failure rate held. If DORA hasn't moved after 90 days, the platform is the constraint, and no amount of AI tooling will fix it.
Your developers have AI coding assistants. Leadership wants to know if they're worth it. The first instinct is to count PRs. This is the wrong measure.
Why output metrics fail for AI productivity
PR count measures activity, not impact. An AI coding assistant that generates boilerplate increases PR count without increasing delivery of user value. What counts as a pull request hasn't changed. What generates pull requests has — and the metric cannot tell the difference between a developer shipping meaningful functionality and a developer shipping AI-generated scaffolding that will be refactored next week.
Code churn compounds the problem. When AI tools generate low-quality code that requires immediate revision, churn increases. More PRs open, more PRs close, the activity dashboard shows rising engagement. The delivery system is degrading. The metric shows green.
Story point velocity was already a weak proxy for delivery performance before AI adoption. AI tools make it weaker. Story points were calibrated against a human development pace, a judgment about how long something takes a developer to write. AI coding assistants break the calibration: a five-point story that once took two days might now produce a draft in two hours. The story point number doesn't change. What you're measuring is now something different from what you were measuring before.
The 2024 DORA State of DevOps Report documented this directly. The same AI tooling produced code quality improvements on healthy delivery systems and a 7.2 percent stability drop on weak ones. The difference between those outcomes was invisible to PR-count and story-point dashboards. Both groups showed rising AI activity metrics. Only one group was improving. Source: DORA State of DevOps Report.
What to measure instead
Applying DORA metrics to AI adoption gives you outcome visibility that activity dashboards don't.
Deployment frequency. Are developers deploying more often? This is the first indicator of real throughput improvement, not just code generation speed. If AI tooling is working, the time from idea to production should be shrinking — and deployment frequency is the observable signal. If deployment frequency hasn't moved after 60 to 90 days of adoption, AI-generated code is either sitting in slow pipelines or being held back by risk-aversion from quality concerns. Neither of those is a tool problem.
Lead time for changes. Is the time from commit to production decreasing? AI coding assistants compress the coding phase of lead time. If total lead time — commit to live — doesn't decrease, the bottleneck shifted to another stage: slow CI, a review queue, a manual deployment process. The platform is absorbing the productivity gain before it reaches delivery.
Change failure rate. AI-generated code without sufficient test coverage will increase change failure rate. If CFR rises after AI tool adoption, the test infrastructure is failing, not the AI tool. The DORA 2024 data shows this split clearly: teams on healthy delivery systems see stability hold or improve; teams on weak delivery systems see stability degrade. The change failure rate is where that degradation appears.
Developer satisfaction from the SPACE framework. The SPACE framework for developer productivity includes a satisfaction dimension. A developer survey asking "is AI tooling reducing frustration with routine work?" correlates better with retention outcomes than any of the activity metrics. Engineers who feel the tooling is helping them are more likely to stay. Engineers who feel the tooling is adding verification burden on top of their existing cognitive load are signaling a platform problem.
The platform engineering connection
AI productivity gains are platform-constrained. This is the finding from the DORA 2024 report, the METR 2025 study, and the Larridin 2026 benchmarks, all pointing in the same direction: the platform underneath the AI tool determines whether the tool improves or degrades delivery outcomes.
Fast CI, well-structured test infrastructure, deployment visibility, and feature flags determine whether AI-generated code ships value or creates churn. Without fast CI, the iteration cycle that makes AI coding valuable breaks down. Without test infrastructure, AI-generated code accumulates uncaught defects. Without deployment visibility, the feedback loop that confirms a change worked is missing. Without feature flags, faster code generation leads to batching and risk-aversion rather than increased deployment frequency.
An AI coding assistant operating on a platform that doesn't have these capabilities will show high activity metrics and flat or declining DORA metrics. The tool is doing its job. The platform is absorbing the gain.
How to run a 90-day AI productivity experiment
Before adopting AI coding tools broadly, baseline your DORA metrics. Record deployment frequency, lead time for changes, change failure rate, and mean time to restore. Note the current state of your test infrastructure, CI speed, and deployment process. This takes a day if the data exists in your tooling.
Adopt the tools. Run them for 90 days with normal team behavior — no special attention to the metrics, no pressure to show improvement, no gaming.
At 90 days, measure the same DORA metrics. Attribution isn't precise — other things changed in 90 days too. But directional movement is observable and meaningful. If deployment frequency increased and change failure rate held, the AI tools are contributing to throughput without degrading stability. If change failure rate rose alongside deployment frequency, AI is generating code faster than the test infrastructure can validate.
The 90-day measurement isn't a proof. It's a decision-making instrument. It tells you whether the current platform supports AI adoption outcomes or whether platform work needs to come first.
What to watch for
If deployment frequency improves but change failure rate rises, the AI is generating code faster than your test infrastructure can validate. The fix is not to slow down AI adoption — it's to improve the test infrastructure. Flaky tests, gaps in coverage for new AI-generated code paths, and review processes that can't keep up with generation speed all contribute. The platform is the lever.
If lead time for changes doesn't move despite AI adoption, the bottleneck is somewhere other than coding. Slow CI is the most common cause. A 25-minute CI run absorbs any code generation speed improvement before it reaches deployment frequency. The second most common cause is a review process that didn't scale when the volume of PRs increased.
If change failure rate stays flat and deployment frequency stays flat, the AI tools may be generating code that engineers are revising extensively before committing — the tool is adding cognitive load rather than reducing it. This is the pattern the METR 2025 study documented: senior developers were 19 percent slower when using AI tools on familiar code, despite reporting that they felt faster. Platform clarity — well-documented codebases, consistent patterns, opinionated tooling — reduces the verification burden that creates this gap.
For more on the platform capabilities that determine AI adoption outcomes, read AI coding assistants and platform readiness. For the DORA metrics context that underpins this measurement approach, DORA metrics for engineering leaders covers the four metrics and how to baseline them.

Mat Caniglia
LinkedInFounder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.
79 articles published