Developer Experience15 min read·April 17, 2026

Counting AI Tokens Is the New Counting Commits

Token usage is an infrastructure cost, not a productivity metric. Here's why most AI measurement frameworks are broken, and what to track instead.

Counting AI Tokens Is the New Counting Commits

Your CFO asks you to demonstrate ROI on the AI tooling budget. You have been paying for GitHub Copilot licenses for eight months, plus a ChatGPT Teams subscription, plus a handful of engineers experimenting with Claude on their own. The CFO wants a number. You go back to your desk and realize you have no idea what to report that would actually mean something.

So you open the GitHub Copilot dashboard. You see token usage. Lines of code accepted. Acceptance rate. You build a slide. You present it. The CFO nods and moves on.

Three months later, your best platform engineer hands in their notice. They tell you the tooling is fine but the deployment process is still broken and AI has not fixed that. Lead time for a production change is still four days. Nobody knows if the AI tools are helping or not, including them.

This is the state of AI productivity measurement at most engineering organizations right now. The vendors give you activity metrics because activity metrics are easy to instrument. Activity metrics are also, with rare exception, useless for making decisions.

We have been here before.

The Last Time We Did This

In 2012, a generation of engineering managers discovered that they could see how many commits each developer made per week. Some of them started tracking it. Some of them started managing to it. A small but notable subset started putting commit counts in performance reviews.

You know how this story ends. Engineers started committing more frequently to inflate their numbers. Some teams abandoned feature branches and committed directly to main ten times a day, not because the team had adopted trunk-based development thoughtfully, but because it made the metric look better. Some developers started splitting single logical changes into five separate commits. None of this made software better. Some of it made software worse.

The lesson was supposed to be that measuring activity does not measure outcomes. Commits are not value. Lines of code are not value. What matters is whether the software is working, getting to production faster, and failing less often.

We apparently did not learn it, because we are doing the same thing with AI tokens.

Why Tokens Are an Infrastructure Cost

A token is the unit of computation in a large language model. When a developer asks Copilot to complete a function, the model processes tokens and returns tokens. When you pay for an AI tool that runs on your infrastructure, you are buying compute. Tokens are what you pay for.

Monitoring token usage is reasonable for one specific purpose: controlling your cloud bill. If your AI infrastructure costs are growing faster than your headcount, you want to know where the consumption is happening. That is a finance and infrastructure question, not a productivity question.

Trying to use token consumption as a proxy for productivity is like trying to use CPU cycles as a proxy for engineering output. A developer who runs an inefficient query in a database loop is consuming more compute than a developer who writes a clean query. More compute, worse outcome. The measure moves in the wrong direction.

The same logic applies to tokens. A developer who prompts an AI tool five times to get a mediocre function done is consuming more tokens than a developer who writes it directly. A developer who uses AI to generate a test suite they immediately commit without reading is consuming tokens efficiently. Whether either of those developers is being productive is a completely different question.

The Problem Nobody Wants to Say Out Loud

Here is the part that should make procurement teams and finance departments uncomfortable: you cannot tell how much of your AI tool usage is work.

When you give an engineer access to GitHub Copilot, they use it at work. When they use it at work, they also sometimes use it for things that are not directly related to their job. They write a script to organize their personal photo library. They ask it to help draft a performance review template. They use it to understand a concept in a programming language they are learning for personal interest. Some of them use it after hours for personal projects.

None of this is scandalous. It is how people use tools. The same thing has always been true of Google search, Stack Overflow, and any other information tool that lives in the browser. Nobody has ever tried to measure the ratio of work-related to personal Stack Overflow queries per engineer per day, and the reason is that it would be both impossible and beside the point.

But the moment you start treating tokens as a productivity signal, you are implicitly assuming that more tokens means more work output. And since you cannot separate the tokens your engineer used to debug a production issue from the tokens they used to ask the model what the capital of Uruguay is, you are measuring something you cannot actually observe.

The measurement is not just imprecise. It is not measuring the thing you think it is measuring.

What Actually Changed When You Gave Everyone AI Tools

Let me offer you a different frame. Instead of asking "how many tokens did my team use," ask "what happened to lead time after we rolled out AI tools?"

If lead time for changes went from four days to two days in the six months after your Copilot rollout, you have evidence that AI is helping. You do not know exactly why. It might be faster code generation, better test coverage, faster code review because the reviewer spends less time on style issues. It does not matter. The outcome moved.

If lead time stayed the same, or got worse, that is also evidence. Either AI is not being used effectively, or AI is helping with individual tasks but the bottleneck was never individual tasks. It was handoffs, approval processes, environment provisioning, or something else that AI cannot touch.

If your change failure rate increased after AI rollout, that is a specific and concerning signal. It suggests that AI-assisted code is getting to production faster but arriving less stable. This is one of the genuine risks of AI in development that deserves serious attention. The DORA 2025 research found that AI adoption tends to amplify existing patterns. Organizations with strong engineering practices get faster and more stable with AI. Organizations with weak practices get faster and less stable.

That finding should terrify every engineering leader who is measuring token counts instead of DORA metrics.

AI Is an Amplifier

This is the most important thing I can say about AI productivity, and it is also the most uncomfortable: AI does not fix broken processes. It makes them faster.

If your code review process is a rubber stamp where reviewers approve PRs without reading them carefully, AI will help engineers write PRs faster, and those PRs will get rubber-stamped faster, and more broken code will get to production faster. The velocity metric goes up. The quality metric goes down.

If your deployment process involves four manual handoffs and a change approval board that meets on Tuesdays, AI will help engineers finish their code faster, and that code will sit in the deployment queue for the same amount of time it always did, and lead time will not improve at all. Engineers will be frustrated because they finished their work on Thursday and it is still not deployed the following Wednesday.

If your on-call rotation is burning people out because you have no SLOs and no error budget and the pager fires for everything, AI will not fix that. Engineers will be slightly more productive during business hours and just as exhausted by the on-call burden as they always were.

The question is not whether AI is making your developers more productive. The question is whether your platform is capable of converting developer productivity into business outcomes. Those are different questions. Only the second one matters for your organization.

The SPACE Framework and Why Activity Is One of Five Dimensions

The SPACE framework, developed by researchers at GitHub and Microsoft, defines developer productivity across five dimensions: Satisfaction and wellbeing, Performance, Activity, Communication and collaboration, and Efficiency and flow.

Activity is one of those five dimensions. Token usage, if you squint, is a sub-signal of activity. The researchers who built SPACE were explicit that no single dimension captures productivity, and that organizations that optimize for activity metrics at the expense of the others tend to make things worse, not better.

The most useful dimensions for measuring AI impact are efficiency and flow, and satisfaction. Is the developer's inner loop faster? Are they spending less time on boilerplate and more time on the hard problems? Do they report feeling more effective? Are they interrupted less often?

Those questions require surveys, conversation, and longitudinal observation. They are harder than pulling a token count from a dashboard. They are also the only questions that produce actionable answers.

Satisfaction matters for a reason most engineering leaders do not fully weight: the developers who benefit most from AI tools are often your senior engineers, because they have enough context to evaluate AI outputs critically and catch the errors. Junior engineers frequently use AI tools with less discrimination, accept worse suggestions, and sometimes learn bad habits from them. If your AI rollout has helped your principal engineers and slightly degraded the growth trajectory of your junior engineers, you will not see that in token counts. You might see it in code review comments, PR cycle time, and six-month retention.

What Good Measurement Actually Looks Like

If you want to know whether AI is helping your engineering organization, here is what to track.

Start with DORA. Measure deployment frequency, lead time for changes, change failure rate, and mean time to recovery before the AI rollout, and compare them six months later. These are not AI metrics. They are engineering health metrics that will tell you whether AI adoption is having a positive or negative effect on the outcomes that matter to the business. If DORA metrics improve, AI is probably helping. If they degrade, AI might be making things faster in the wrong direction.

Track PR cycle time separately from lead time. This isolates the code review and revision loop, which is one of the areas where AI tools have the clearest impact. If your developers are writing better first drafts and reviewers are spending less time on style and syntax issues, you should see PR cycle time decrease. If it stays the same, the AI is not changing behavior in code review.

Measure time to first working deployment for new service scaffolding, if you have golden paths. If your IDP provides a scaffolding template and AI tools help engineers customize it, you should be able to measure how long it takes from "new service ticket" to "first successful deployment to staging." That is a clean signal that AI is compressing the inner loop in a meaningful way.

Survey your engineers. Ask them quarterly: do you feel more or less effective than six months ago? Are you spending more time on creative problem-solving and less time on routine tasks? Are you confident in the code you are producing with AI assistance? Those subjective signals, aggregated across your team, are a leading indicator of whether the tooling is actually improving their experience.

Track AI tool adoption by tenure and role, not in aggregate. If senior engineers use Copilot frequently and junior engineers use it rarely, that is a training problem. If junior engineers use it constantly and your PR rejection rate for junior engineers has gone up, that is a different kind of problem. Aggregate usage numbers hide the distribution that actually tells you what is happening.

The Honest Conversation About ROI

At some point, someone will ask you to put a dollar figure on your AI investment. Here is how to have that conversation without making up numbers.

If lead time for changes is down by a meaningful amount, calculate the engineering cost of that time reduction. If your team ships features thirty percent faster on average, that is thirty percent more capacity at the same headcount. Calculate what that capacity would cost if you had to hire for it. That is your ROI floor, not your ROI ceiling, because it does not account for compounding effects or quality improvements.

If change failure rate is stable or improved despite faster velocity, that has value too. Every production incident costs engineering time, customer trust, and in regulated industries, potential compliance exposure. A team that ships faster and fails less often is generating compounding returns.

What you cannot honestly tell your CFO is that token count times X equals productivity, because no such equation exists. The vendors would like you to believe it does because it makes their tools look good in a spreadsheet. It does not reflect how engineering work actually produces value.

A Few Things Worth Saying to Make People Uncomfortable

I am going to say a few things plainly, because they get debated in vague language when they should be stated directly.

Monitoring individual developer AI usage for performance evaluation purposes is a mistake. Not primarily for privacy reasons, though those matter. It is a mistake because the signal is garbage. You cannot evaluate performance from token counts, and if you try, you will create the same gaming behavior that commit counting created, except faster and harder to detect.

Restricting AI tool access to control costs is often the wrong tradeoff. The platforms that let engineers use AI in their natural workflow, with reasonable guardrails around proprietary data, consistently outperform the platforms that apply per-seat restrictions or block certain tools. The cost of a GitHub Copilot seat is a rounding error compared to the cost of a senior engineer spending two extra days on a feature.

The organizations that will get the most from AI are the ones that have already done the platform engineering work. If your deployment pipeline is reliable, your observability is solid, your golden paths are real, and your developers trust the platform, AI will make all of that faster and better. If none of those things are true, AI will accelerate the chaos.

That last point is the one that should drive your priorities. Not the AI rollout. The foundation that makes the AI rollout worth doing.

Where to Go From Here

If you want to understand your engineering health before drawing conclusions about AI impact, the starting point is a baseline. DORA metrics, developer satisfaction, PR cycle time, time-to-production for new services. Get that baseline before you run the six-month comparison.

If you are trying to make sense of where your platform stands across all five pillars, the Free Platform Score gives you a radar across Platform Foundations, Delivery Engineering, Reliability, Observability, and Developer Experience. It takes fifteen minutes and the full report is free. The results will tell you whether AI is likely to help you or amplify your current problems.

And if after reading this you are realizing that the platform foundation is what needs attention, not the AI dashboard, our Foundations Assessment engagement is where that conversation starts. Structured phases, baseline before intervention, outcomes you own. The same approach applies whether you are thinking about AI tooling, an IDP, or improving DORA metrics.

The token dashboard will still be there. It will just stop being mistaken for something it is not.

AI in EngineeringDeveloper ProductivityDORA MetricsEngineering LeadershipDeveloper Experience

Found this useful? Share it with your network.

Matías Caniglia

Matías Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

30 articles published

Related Articles

Developer Experience

Engineering Tools Worth Evaluating in 2025 (And the Criteria That Matter)

The tooling landscape is noisier than ever. Here's how to cut through it and the specific tools earning adoption in high-performing engineering teams.

Read More →
Developer Experience

What AI Actually Changes About Engineering Teams (And What It Doesn't)

AI coding tools are real and useful. But the teams misunderstanding what they actually change are heading for disappointment.

Read More →
Developer Experience

Why Your Best Developers Are Quitting (And It's Not About Money)

The real reasons your top engineers leave, and the specific changes that make them stay. From someone who has been on both sides of this conversation.

Read More →

Stay Updated with Clouditive

Get the latest insights on DevOps, Platform Engineering, and Developer Experience delivered to your inbox every Tuesday.

Ready when you are

Three ways to get started.

No matter where you are in the process, there is a right entry point.

Want to read first? See the Foundations Framework →