Counting AI Tokens Is the New Counting Commits

TL;DR. Token usage tells you how much compute your AI tools consumed, not how much value they produced. The same error that turned commit counts into a performance metric in 2012 is being repeated with AI activity dashboards. What actually tells you whether AI tooling is helping is DORA metrics before and after rollout: if lead time dropped, AI is probably helping; if change failure rate increased, AI is compressing the delivery loop without improving quality. The DORA 2025 research finding that AI amplifies existing patterns — stronger teams get faster and more stable, weaker teams get faster and less stable — is the most important data point for how to sequence the investment.

Your CFO asks you to demonstrate ROI on the AI tooling budget. You have been paying for GitHub Copilot licenses for eight months, plus a ChatGPT Teams subscription, plus a handful of engineers experimenting with Claude on their own. The CFO wants a number. You go back to your desk and realize you have no idea what to report that would actually mean something.

So you open the GitHub Copilot dashboard. You see token usage. Lines of code accepted. Acceptance rate. You build a slide. You present it. The CFO nods and moves on.

Three months later, your best platform engineer hands in their notice. They tell you the tooling is fine but the deployment process is still broken and AI has not fixed that. Lead time for a production change is still four days. Nobody knows if the AI tools are helping or not, including them.

This is the state of AI productivity measurement at most engineering organizations right now. The vendors give you activity metrics because activity metrics are easy to instrument. Activity metrics are also, with rare exception, useless for making decisions.

We have been here before.

The last time engineering tried to measure productivity with activity counts

In 2012, a generation of engineering managers discovered that they could see how many commits each developer made per week. Some of them started tracking it. Some of them started managing to it. A small but notable subset started putting commit counts in performance reviews.

You know how this story ends. Engineers started committing more frequently to inflate their numbers. Some teams abandoned feature branches and committed directly to main ten times a day, not because the team had adopted trunk-based development thoughtfully, but because it made the metric look better. Some developers started splitting single logical changes into five separate commits. None of this made software better. Some of it made software worse.

The lesson was supposed to be that measuring activity does not measure outcomes. Commits are not value. Lines of code are not value. What matters is whether the software is working, getting to production faster, and failing less often.

We apparently did not learn it, because we are doing the same thing with AI tokens.

Tokens are infrastructure costs — using them as productivity proxies measures the wrong thing

A token is the unit of computation in a large language model. When a developer asks Copilot to complete a function, the model processes tokens and returns tokens. When you pay for an AI tool that runs on your infrastructure, you are buying compute. Tokens are what you pay for.

Monitoring token usage is reasonable for one specific purpose: controlling your cloud bill. If your AI infrastructure costs are growing faster than your headcount, you want to know where the consumption is happening. That is a finance and infrastructure question, not a productivity question.

Trying to use token consumption as a proxy for productivity is like trying to use CPU cycles as a proxy for engineering output. A developer who runs an inefficient query in a database loop is consuming more compute than a developer who writes a clean query. More compute, worse outcome. The measure moves in the wrong direction.

The same logic applies to tokens. A developer who prompts an AI tool five times to get a mediocre function done is consuming more tokens than a developer who writes it directly. A developer who uses AI to generate a test suite they immediately commit without reading is consuming tokens efficiently. Whether either of those developers is being productive is a completely different question.

You cannot tell how much of your AI tool usage is actually work

Here is the part that should make procurement teams and finance departments uncomfortable: you cannot tell how much of your AI tool usage is work.

When you give an engineer access to GitHub Copilot, they use it at work. When they use it at work, they also sometimes use it for things that are not directly related to their job. They write a script to organize their personal photo library. They ask it to help draft a performance review template. They use it to understand a concept in a programming language they are learning for personal interest. Some of them use it after hours for personal projects.

None of this is scandalous. It is how people use tools. The same thing has always been true of Google search, Stack Overflow, and any other information tool that lives in the browser. Nobody has ever tried to measure the ratio of work-related to personal Stack Overflow queries per engineer per day, and the reason is that it would be both impossible and beside the point.

But the moment you start treating tokens as a productivity signal, you are implicitly assuming that more tokens means more work output. And since you cannot separate the tokens your engineer used to debug a production issue from the tokens they used to ask the model what the capital of Uruguay is, you are measuring something you cannot actually observe.

The measurement is not just imprecise. It is not measuring the thing you think it is measuring.

What DORA metrics tell you that token dashboards cannot

Let me offer you a different frame. Instead of asking "how many tokens did my team use," ask "what happened to lead time after we rolled out AI tools?"

If lead time for changes went from four days to two days in the six months after your Copilot rollout, you have evidence that AI is helping. You do not know exactly why. It might be faster code generation, better test coverage, faster code review because the reviewer spends less time on style issues. It does not matter. The outcome moved.

If lead time stayed the same, or got worse, that is also evidence. Either AI is not being used effectively, or AI is helping with individual tasks but the bottleneck was never individual tasks. It was handoffs, approval processes, environment provisioning, or something else that AI cannot touch.

If your change failure rate increased after AI rollout, that is a specific and concerning signal. It suggests that AI-assisted code is getting to production faster but arriving less stable. This is one of the genuine risks of AI in development that deserves serious attention. The DORA 2025 research found that AI adoption tends to amplify existing patterns. Organizations with strong engineering practices get faster and more stable with AI. Organizations with weak practices get faster and less stable.

That finding should terrify every engineering leader who is measuring token counts instead of DORA metrics.

AI amplifies existing patterns — it does not fix broken processes, it accelerates them

This is the most important thing I can say about AI productivity, and it is also the most uncomfortable: AI does not fix broken processes. It makes them faster.

If your code review process is a rubber stamp where reviewers approve PRs without reading them carefully, AI will help engineers write PRs faster, and those PRs will get rubber-stamped faster, and more broken code will get to production faster. The velocity metric goes up. The quality metric goes down.

If your deployment process involves four manual handoffs and a change approval board that meets on Tuesdays, AI will help engineers finish their code faster, and that code will sit in the deployment queue for the same amount of time it always did, and lead time will not improve at all. Engineers will be frustrated because they finished their work on Thursday and it is still not deployed the following Wednesday.

If your on-call rotation is burning people out because you have no SLOs and no error budget and the pager fires for everything, AI will not fix that. Engineers will be slightly more productive during business hours and just as exhausted by the on-call burden as they always were.

The question is not whether AI is making your developers more productive. The question is whether your platform is capable of converting developer productivity into business outcomes. Those are different questions. Only the second one matters for your organization.

Why activity is only one of five dimensions in a complete productivity measurement

The SPACE framework, developed by researchers at GitHub and Microsoft, defines developer productivity across five dimensions: Satisfaction and wellbeing, Performance, Activity, Communication and collaboration, and Efficiency and flow.

Activity is one of those five dimensions. Token usage, if you squint, is a sub-signal of activity. The researchers who built SPACE were explicit that no single dimension captures productivity, and that organizations that optimize for activity metrics at the expense of the others tend to make things worse, not better.

The most useful dimensions for measuring AI impact are efficiency and flow, and satisfaction. Is the developer's inner loop faster? Are they spending less time on boilerplate and more time on the hard problems? Do they report feeling more effective? Are they interrupted less often?

Those questions require surveys, conversation, and longitudinal observation. They are harder than pulling a token count from a dashboard. They are also the only questions that produce actionable answers.

Satisfaction matters for a reason most engineering leaders do not fully weight: the developers who benefit most from AI tools are often your senior engineers, because they have enough context to evaluate AI outputs critically and catch the errors. Junior engineers frequently use AI tools with less discrimination, accept worse suggestions, and sometimes learn bad habits from them. If your AI rollout has helped your principal engineers and slightly degraded the growth trajectory of your junior engineers, you will not see that in token counts. You might see it in code review comments, PR cycle time, and six-month retention.

What to actually track to know whether AI is helping your engineering organization

If you want to know whether AI is helping your engineering organization, here is what to track.

Start with DORA. Measure deployment frequency, lead time for changes, change failure rate, and mean time to recovery before the AI rollout, and compare them six months later. These are not AI metrics. They are engineering health metrics that will tell you whether AI adoption is having a positive or negative effect on the outcomes that matter to the business. If DORA metrics improve, AI is probably helping. If they degrade, AI might be making things faster in the wrong direction.

Track PR cycle time separately from lead time. This isolates the code review and revision loop, which is one of the areas where AI tools have the clearest impact. If your developers are writing better first drafts and reviewers are spending less time on style and syntax issues, you should see PR cycle time decrease. If it stays the same, the AI is not changing behavior in code review.

Measure time to first working deployment for new service scaffolding, if you have golden paths. If your IDP provides a scaffolding template and AI tools help engineers customize it, you should be able to measure how long it takes from "new service ticket" to "first successful deployment to staging." That is a clean signal that AI is compressing the inner loop in a meaningful way.

Survey your engineers. Ask them quarterly: do you feel more or less effective than six months ago? Are you spending more time on creative problem-solving and less time on routine tasks? Are you confident in the code you are producing with AI assistance? Those subjective signals, aggregated across your team, are a leading indicator of whether the tooling is actually improving their experience.

Track AI tool adoption by tenure and role, not in aggregate. If senior engineers use Copilot frequently and junior engineers use it rarely, that is a training problem. If junior engineers use it constantly and your PR rejection rate for junior engineers has gone up, that is a different kind of problem. Aggregate usage numbers hide the distribution that actually tells you what is happening.

How to have the AI ROI conversation without making up numbers

At some point, someone will ask you to put a dollar figure on your AI investment. Here is how to have that conversation without making up numbers.

If lead time for changes is down by a meaningful amount, calculate the engineering cost of that time reduction. If your team ships features thirty percent faster on average, that is thirty percent more capacity at the same headcount. Calculate what that capacity would cost if you had to hire for it. That is your ROI floor, not your ROI ceiling, because it does not account for compounding effects or quality improvements.

If change failure rate is stable or improved despite faster velocity, that has value too. Every production incident costs engineering time, customer trust, and in regulated industries, potential compliance exposure. A team that ships faster and fails less often is generating compounding returns.

What you cannot honestly tell your CFO is that token count times X equals productivity, because no such equation exists. The vendors would like you to believe it does because it makes their tools look good in a spreadsheet. It does not reflect how engineering work actually produces value.

Three things about AI measurement that should make procurement teams uncomfortable

I am going to say a few things plainly, because they get debated in vague language when they should be stated directly.

Monitoring individual developer AI usage for performance evaluation purposes is a mistake. Not primarily for privacy reasons, though those matter. It is a mistake because the signal is garbage. You cannot evaluate performance from token counts, and if you try, you will create the same gaming behavior that commit counting created, except faster and harder to detect.

Restricting AI tool access to control costs is often the wrong tradeoff. The platforms that let engineers use AI in their natural workflow, with reasonable guardrails around proprietary data, consistently outperform the platforms that apply per-seat restrictions or block certain tools. The cost of a GitHub Copilot seat is a rounding error compared to the cost of a senior engineer spending two extra days on a feature.

The organizations that will get the most from AI are the ones that have already done the platform engineering work. If your deployment pipeline is reliable, your observability is solid, your golden paths are real, and your developers trust the platform, AI will make all of that faster and better. If none of those things are true, AI will accelerate the chaos.

That last point is the one that should drive your priorities. Not the AI rollout. The foundation that makes the AI rollout worth doing.

Where to start if you want to measure AI impact honestly

If you want to understand your engineering health before drawing conclusions about AI impact, the starting point is a baseline. DORA metrics, developer satisfaction, PR cycle time, time-to-production for new services. Get that baseline before you run the six-month comparison.

If you are trying to make sense of where your platform stands across all five pillars, the Free Platform Score gives you a radar across Platform Foundations, Delivery Engineering, Reliability, Observability, and Developer Experience. It takes fifteen minutes and the full report is free. The results will tell you whether AI is likely to help you or amplify your current problems.

And if after reading this you are realizing that the platform foundation is what needs attention, not the AI dashboard, our Foundations Assessment engagement is where that conversation starts. Structured phases, baseline before intervention, outcomes you own. The same approach applies whether you are thinking about AI tooling, an IDP, or improving DORA metrics.

The token dashboard will still be there. It will just stop being mistaken for something it is not.

The AI amplifier — what DORA 2025 is actually telling you about your platform — the DORA 2025 finding that AI amplifies existing engineering conditions, with three observable signals for knowing which side of the curve you are on.
Three platform measurement traps that make your dashboards theater — AI acceptance rate and token consumption are two of those three traps, analyzed in full.
Developer experience measurement — beyond survey scores — the operational signals that replace survey scores when measuring whether AI tooling is improving developer experience.
Why DORA metrics are lying to you — signal integrity in platform engineering — the signal integrity problems that exist before AI enters the picture, and that AI adoption makes more expensive.

Frequently asked questions

Why are token counts a misleading AI productivity metric?

Because token consumption measures compute cost, not value produced. A developer who prompts an AI tool five times to produce a mediocre function consumes more tokens than one who writes it directly. A developer who generates a test suite and commits it without reviewing it consumes tokens efficiently. Whether either developer is being productive is a completely separate question. Tokens are the unit of compute you pay for — they are an infrastructure cost metric, exactly as CPU cycles are. Using them to evaluate engineering output produces the same gaming behavior that commit counts produced a decade ago.

What metrics actually tell you whether AI rollout is helping?

Start with DORA metrics measured before the rollout and compared six months later. If deployment frequency and lead time improved, AI is probably helping. If change failure rate increased, AI is accelerating code to production without improving its quality. PR cycle time isolates the code review loop. Time-to-first-deployment for new service scaffolding is a clean signal when golden paths exist. Quarterly engineer surveys asking whether they feel more effective and confident in AI-assisted output capture the subjective dimension that activity metrics miss entirely.

What does the DORA research say about AI adoption in engineering organizations?

The DORA 2025 research found that AI adoption amplifies existing patterns. Organizations with strong engineering practices — reliable deployment pipelines, solid observability, good developer experience foundations — get faster and more stable when they adopt AI tools. Organizations with weak practices get faster and less stable. The implication is that measuring AI impact without first knowing your DORA baseline misses whether you are in the amplification-positive or amplification-negative camp. Getting the foundation right before the AI rollout produces substantially better outcomes than adopting AI on top of a broken platform.

Is it appropriate to track individual AI tool usage for performance evaluation?

No. The signal is not useful for that purpose. You cannot separate work-related token usage from non-work usage, and even if you could, you would be measuring activity rather than output. Tracking individual usage for performance evaluation creates the same gaming behavior as commit counting — developers who know usage is being measured will optimize for the metric, not for the outcome. The organizations that extract the most value from AI tools are those that let engineers use them in their natural workflow with reasonable guardrails around proprietary data, rather than those that track and restrict individual consumption.

Counting AI Tokens Is the New Counting Commits

Counting AI Tokens Is the New Counting Commits

The last time engineering tried to measure productivity with activity counts

Tokens are infrastructure costs — using them as productivity proxies measures the wrong thing

You cannot tell how much of your AI tool usage is actually work

What DORA metrics tell you that token dashboards cannot

AI amplifies existing patterns — it does not fix broken processes, it accelerates them

Why activity is only one of five dimensions in a complete productivity measurement

What to actually track to know whether AI is helping your engineering organization

How to have the AI ROI conversation without making up numbers

Three things about AI measurement that should make procurement teams uncomfortable

Where to start if you want to measure AI impact honestly

Frequently asked questions

Related Articles

Developer Onboarding Time as a Platform Metric

Measuring AI Developer Productivity: Why PR Count Is the Wrong Metric

The SPACE Framework: What It Measures and Where DORA Stops

Stay updated with Clouditive

See where your delivery stands.