Three platform measurement traps that make your dashboards theater
TL;DR. Metrics become theater when tooling evolves faster than metric definitions. Tokens consumed rewards imprecise prompting and high iteration volume — the behavior you do not want. AI acceptance rate borrowed its positive connotation from code review, where it meant reviewers trusted the work; in AI suggestions, it means developers stopped reviewing carefully. Deployment frequency was a legitimate signal in 2018 before feature flags decoupled deploys from releases; in 2026, a team can deploy fifty times a day and ship one customer-visible change per week. The corrective principle: every claim needs at least two independent signals, and the measurement stack gets audited every quarter for what new tools have made misleading.
Every platform team in 2026 falls into at least one of these. Most fall into all three.
The pattern is not new. Lines of code was an honest proxy for output in 1975 when teams hand-typed assembly. Story points were an honest proxy for effort when backlogs were homogeneous. Velocity was an honest proxy for throughput when sprints mapped cleanly to shippable units. Each metric was calibrated to a tool generation it outlived.
The same thing is happening now, faster, because AI tools are changing the relationship between activity and output faster than any previous generation of tooling. The metrics your dashboard shows today were honest when your toolchain looked different. They are theater now. Not because someone gamed them. Because the tools changed and the metric definitions did not.
Tokens consumed tells you the AI ran a lot
A high token count tells you the AI ran a lot. It does not tell you that anything shipped, that quality held, or that the developer thought any harder. Yet it sits on dashboards as a productivity metric.
The tokens consumed metric emerged because it is easy to instrument. Every major AI coding tool exposes token counts via API. Engineering leaders who had been measuring lines of code found tokens consumed as a natural successor: a volume metric that lives in the same mental category. The problem is that the relationship between token volume and engineering output is loose at best and inverse at worst.
Consider two developers working on the same feature. Developer A writes a precise prompt, gets a useful suggestion, edits it for thirty seconds, and commits. Developer B writes a vague prompt, gets an irrelevant response, re-prompts eight times, accepts a suggestion that compiles but contains three subtle bugs, and commits. Developer B has consumed significantly more tokens. On a tokens-consumed dashboard, Developer B looks more productive. The metric rewards exactly the behavior you do not want: imprecise prompting, uncritical acceptance, and high iteration volume on low-quality outputs.
Token consumption is a cost metric, not a productivity metric. Track it as a cost input, the same way you track cloud compute spend: relevant for budget governance, not for measuring engineering output.
For output, measure complexity-adjusted throughput: the number of units of work completed, weighted by the architectural complexity of each unit. Larridin's 2026 benchmark data shows that industry average is 7 complexity-adjusted points per engineer per week, rising to 12 for AI-assisted work where the AI is being used well. The ratio between those two numbers, not the raw token count, tells you whether your AI adoption is producing output or activity. Source: Larridin Developer Productivity Benchmarks 2026.
The second metric that replaces tokens consumed is code durability: the ratio of AI-assisted code that survives without modification past a defined time horizon (30 days, 90 days, depending on your release cadence). Code that gets accepted and then immediately reworked signals a quality problem that token volume obscures entirely.
AI acceptance rate rewards exactly the wrong behavior
A 70 percent acceptance rate sounds impressive. It also describes a developer who stopped reviewing carefully and started accepting whatever compiles.
AI acceptance rate was borrowed from code review metrics, where a high acceptance rate on pull requests was a signal of high-quality PRs. The logic transferred badly. In code review, a high acceptance rate means reviewers trusted the work. In AI suggestions, a high acceptance rate means developers accepted suggestions without sufficient scrutiny. The directional interpretation flipped, but the metric kept its positive connotation.
The DORA 2025 report found that AI acts as an amplifier of existing engineering conditions. Strong teams use it to produce better output. Weak teams use it to produce more output, but not better output. Source: DORA State of AI-assisted Software Development 2025. A high acceptance rate in a strong team reflects fast, accurate suggestions being accepted with genuine judgment. A high acceptance rate in a weak team reflects undiscriminating approval of whatever the model produces. The number looks the same. The outcome differs dramatically.
The METR 2025 study surfaced a parallel problem. Senior developers using AI coding assistants in unfamiliar codebases were nineteen percent slower than baseline, even when they self-reported feeling faster. Source: METR, Measuring the impact of early 2025 AI on experienced open source developer productivity. The developers were not being lazy. They were accepting suggestions they could not fully evaluate, integrating outputs that required rework, and paying a cognitive cost they could not see. Acceptance rate looked fine. Output quality and delivery speed did not.
Two metrics replace acceptance rate as a quality signal. The first is post-merge modification rate: the percentage of AI-assisted commits that required significant modification within the following sprint. Track this by tagging commits at merge time with an AI-assisted flag, then comparing the modification history at 14 and 30 days. Commits that required more than a 20 percent rework signal that the original acceptance was premature. This is a lagging indicator, but it is honest where acceptance rate is not.
The second is adversarial review rate: the percentage of AI-assisted code that goes through explicit review before merge, where the reviewer's job is to find problems rather than approve quickly. This is a process metric, not an output metric, but it is a leading indicator of whether your team has built review habits that can catch AI errors before they ship. Track both together. High post-merge modification rate with low adversarial review rate tells you the team is accepting too fast and discovering the cost downstream.
Deployment frequency was honest in 2018
Feature flags decoupled deploys from releases years ago. A team can deploy fifty times a day and ship one customer-visible change per week. The metric was honest in 2018. In 2026, it is theater.
Deployment frequency was one of the four DORA keys when Nicole Forsgren, Jez Humble, and Gene Kim published Accelerate in 2018. It was a legitimate signal at the time: teams with high deployment frequency had built the automation, testing, and operational confidence required to deploy safely and often. The frequency was evidence of capability.
The tooling shift changed the relationship. Feature flags, trunk-based development with short-lived branches, and zero-downtime deployment patterns decoupled the act of deploying from the act of releasing. A modern team with good feature flagging infrastructure can deploy unfinished work safely. They should, because it reduces merge conflicts and keeps main releasable. But their deployment frequency is high because their deployment architecture is sound, not because their delivery throughput is high. The metric captures the first and claims to capture the second.
AI-generated code accelerates the gap further. A team that ships AI-generated code frequently may have high deployment frequency and high change failure rate simultaneously: deploying often, discovering problems often, rolling back often. Deployment frequency says "elite performer." Change failure rate says "process is broken." The two metrics together are what matters. Deployment frequency alone is meaningless.
The Foundations Framework's Signal Integrity pillar prescribes triangulation: no single metric is trusted alone, every claim needs at least two independent signals that measure from different vantage points.
For delivery throughput, triangulate deployment frequency against two other signals. The first is customer-visible change frequency: the rate at which features become available to users behind completed feature flag states, not just deployed. In most organizations with feature flags, this is one half to one fifth the raw deployment frequency. The ratio between the two numbers is itself a useful signal: a ratio below 0.3 suggests the deployment pipeline is fast but the release decision process is slow, which is a different problem than deployment capability.
The second is escaped defect rate: the number of defects discovered by customers per unit of delivered change. High deployment frequency with rising escaped defects is not elite performance. It is a signal that the testing and review gates are not keeping pace with the deployment cadence. That condition is far more common in AI-assisted development than in traditional development, because AI generates code faster than human review processes were designed to evaluate.
The corrective principle: measurement must outpace the tools that made it obsolete
Each metric was honest in the previous generation of tools. Now it is theater.
The pattern repeats every five to seven years, aligned with major tooling shifts. Lines of code. Story points. Velocity. Each became theater not through deliberate gaming but through tooling evolution that the metric definition did not track. Tokens and acceptance rate under AI are the current iteration.
The corrective principle is that measurement must outpace the tools. Metric definitions drift silently as tooling generations change. Without a quarterly audit of what new tools have made misleading, the dashboards report numbers that no longer measure what their labels claim. Engineering leaders then make investment decisions on instrumentation that has not been audited. The decisions feel data-driven. They are not.
The Foundations Framework makes this audit explicit. The Signal Integrity pillar requires that the measurement stack is reviewed every quarter for what new tools have made misleading. Not because the organization is measuring wrong, but because the tools keep changing. A measurement stack that does not audit itself will be obsolete within twelve months of any major tooling shift. AI tooling is a major tooling shift.
The replacement for single-metric thinking is triangulation across three independent measurement types. Behavioral signals measure what people actually do: paved road compliance under pressure, code review patterns, prompt quality indicators. Outcome signals measure what the system produces: customer-visible change frequency, escaped defect rate, mean time to restore, senior engineer attrition. Signal integrity checks verify that the measurement instruments themselves are honest: do two engineers measuring the same sprint with the same data produce the same number?
No single metric from any category is trusted alone. Every claim requires at least two independent signals. Metrics are defined in observable behavior, not in self-report or vendor telemetry. The measurement stack is audited every quarter for what new tools have made misleading.
How to run a two-hour audit of your current measurement stack
The audit takes two hours. Run it at the start of your next quarterly platform review.
Take every metric on your current platform dashboard. For each metric, ask three questions. First: when was the definition of this metric last reviewed against current tooling? If the answer is "we've always measured it this way," the metric is a candidate for obsolescence review. Second: could a team optimize this metric without improving the underlying outcome it claims to measure? If the answer is yes, the metric is gameable and may already be gamed. Third: what is the second independent signal that triangulates this metric? If there is no second signal, the metric is a single point of measurement failure.
Most platform teams, running this audit for the first time, find that three to five of their dashboard metrics fail at least one of these questions. The goal is not to remove them. The goal is to pair each one with a second signal that catches the cases the primary metric misses.
The measurement trap is not that teams measure wrong. It is that they stop asking whether their measurements are still measuring what they think they are.
Related reading
- The AI amplifier — what DORA 2025 is actually telling you about your platform — AI amplifies the measurement problems described here: weak signal integrity on a weak platform produces compounding failures, not just misleading dashboards.
- Why DORA metrics are lying to you — signal integrity in platform engineering — the Signal Integrity pillar that this post's corrective principle is built on, explained in full.
- Counting AI tokens is the new counting commits — the same theater problem applied specifically to AI productivity dashboards.
- Developer experience measurement — beyond survey scores — three operational signals that fill the gaps that single-metric dashboards leave.
Frequently asked questions
Why does high AI acceptance rate indicate a problem rather than high productivity?
Because the directional interpretation of acceptance rate flipped when the context changed from code review to AI suggestions. In code review, a high acceptance rate meant reviewers trusted the quality of the work submitted. In AI suggestions, a high acceptance rate means developers accepted suggestions without sufficient scrutiny. The DORA 2025 research found that AI amplifies existing engineering conditions — strong teams use it to produce better output, weak teams use it to produce more output. A high acceptance rate in a weak team reflects undiscriminating approval. The number looks the same as a strong team's number. The outcome differs dramatically.
What should replace deployment frequency as a delivery throughput metric?
Triangulate deployment frequency against customer-visible change frequency and escaped defect rate. Customer-visible change frequency is the rate at which features become available to users behind completed feature flag states, not just deployed code. In organizations with feature flags, this is typically one half to one fifth the raw deployment frequency. The ratio between them is itself a signal: a ratio below 0.3 suggests deployment capability is strong but release decision process is slow. Escaped defect rate — defects discovered by customers per unit of delivered change — catches the case where high deployment frequency coexists with deteriorating quality.
What is a quarterly measurement stack audit and how does it work?
Take every metric on your current platform dashboard and ask three questions about each one. First: when was this metric's definition last reviewed against current tooling — if the answer is "we've always measured it this way," it is a candidate for obsolescence review. Second: could a team optimize this metric without improving the underlying outcome it claims to measure — if yes, it is gameable and may already be gamed. Third: what is the second independent signal that triangulates this metric — if there is none, it is a single point of measurement failure. Most platform teams find three to five metrics that fail at least one question on their first audit.
If you want to audit your platform's measurement stack against the Signal Integrity pillar of the Foundations Framework, the Foundations Assessment runs a structured four to six week diagnostic that includes a quarterly measurement audit as a deliverable. The free Platform Score gives you a fifteen-minute read across all five pillars, including Signal Integrity.

Mat Caniglia
LinkedInFounder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.
79 articles published