← Back to Insights

Insight

The Wrong Denominator

Ariel Agor
The Wrong Denominator

Listen · Read by Leo · click any word to jump

0:00 / · loading…

On May 28, 2026, Axios published a story that should have been wired into every board pack in America by the following Monday. An AI consultant described a client that spent half a billion dollars on Claude in a single month after the company failed to put usage limits on employee licenses. The same week, Uber's COO told staff that the company's AI bill was getting harder to justify. Microsoft, the single largest buyer of frontier-model capacity on earth, quietly cancelled most of its internal Claude Code seats and pushed engineers back to a homegrown stack. These stories shared one discovery. The productivity tool everyone bought has a meter on it. The meter does not stop. The meter does not care about your headcount budget.

This is the year measuring ROI on AI initiatives became the question that decides which AI strategies survive Q4 and which get rolled back in front of a furious board. The pilots ran. The dashboards lit up. Then the invoice arrived.

The Denominator That No Longer Holds

For thirty years, software ROI used a denominator built for a license model. You bought a seat. The seat had a price. You divided the value the seat produced by the price of the seat and shipped the answer to finance. Excel cost the same whether your analyst opened it once a month or all day. Salesforce cost the same whether your rep wrote one email or a hundred. Slack cost the same whether your engineer typed in it or left it open and forgot. The denominator was fixed. The numerator was the question.

AI does not work that way. The denominator moves every second the worker is at the keyboard. A token gets billed when the model reads the prompt. A token gets billed when the model writes the answer. A token gets billed when the agent thinks before it acts, when it reads a document to remember what it was doing, when it checks its own work, when it loops back because the first answer was wrong. The harder your team adopts the tool, the more tokens move. The more tokens move, the more the bill grows. The CFO who priced the pilot on a per-seat assumption did the worst math possible. They priced the smallest version of the future and called it the budget.

This is the meter problem, and it explains why so many AI ROI calculations show a positive return in March and a negative return by September. The pilot rolled out to forty users. Forty users built habits. Habits drove usage. Usage drove cost. Nothing else changed except the people got better at using the tool. The bill grew because the tool was working. The ROI fell because the company was measuring against the wrong denominator.

The half-billion-dollar month in the Axios story was not an aberration. It was a CFO who priced the floor and forgot to model the ceiling. The model itself did nothing surprising. It answered questions. Engineers asked more questions. Reasoning models multiplied the token cost of each question by ten or twenty against the chat-only baseline. The bill found the ceiling the budget had ignored.

What the MIT Number Actually Means

The MIT NANDA report that circulated through every CFO LinkedIn feed last August found that ninety-five percent of generative AI pilots delivered no measurable impact on profit and loss. The number got read as proof that AI does not work. The report said something different. The researchers spent three hundred case studies and several hundred interviews showing that the gap was not in the model. It was in the learning loop. Tools that did not remember the user, did not adapt to the workflow, and did not get integrated into the systems where actual work happened produced no measurable change. Tools that did all three did.

The follow-on number is the one that matters for measuring ROI on AI initiatives. In a Microsoft study covered by Fortune on May 11, 2026, the company's Work Trend Index found that organizational factors (culture, manager support, talent practices, redesigned workflows) accounted for sixty-seven percent of reported AI impact. Individual mindset and behavior accounted for thirty-two percent. The model itself, the choice of vendor, the price per million tokens, did not even show up in the variance. Two thirds of whether AI pays off has nothing to do with the AI.

A finance team that builds an ROI model around vendor selection is building it around the variable that explains the least. That is why so many enterprise AI dashboards say the pilot succeeded right up until the moment the executive who paid for it gets a Slack message asking why the budget is gone.

The Klarna Receipt

The cleanest public case study in measuring ROI on AI initiatives wrong is Klarna. In 2022, the company laid off around seven hundred customer service employees. By February 2024, CEO Sebastian Siemiatkowski announced the AI assistant had handled 2.3 million customer chats in thirty-five languages and was doing the work of those seven hundred people. The story was the model story of the era. Every analyst note used it. Every consulting deck included it. The implied ROI was crushing. Heads went down forty percent. Volume held. The vendor relationship was a quiet OpenAI integration. The savings were claimed at $40 million annualized.

Two years later, Klarna was quietly rehiring. CSAT had degraded on complex tickets. About five percent of conversations contained errors the agent did not catch. The cost savings the original announcement projected did not fully land. By the time the company filed its IPO paperwork in 2025, Siemiatkowski had publicly walked the position back and admitted the cuts went too far. The hybrid model that emerged (AI for routine, humans for complex, escalation for emotional) was the model the customer service literature has recommended since the first call center metrics paper in the 1990s. Klarna did not invent it. They paid the cost of pretending they could skip the learning curve, then paid the cost of rebuilding what they had dismantled.

What did the original ROI math miss? It used the wrong unit. The dashboard measured the chat. The reality measured the customer-lifetime value of the customer on the other end of the chat. A ten-second resolution that loses the customer is worth less than a four-minute resolution that keeps them. The pilot dashboard measured the easy variable because it was easy. The hard variable, customer retention through emotional friction, did not live in the AI vendor's reporting suite. Nobody looked at it until the cohort data caught up. By then the rebuild was more expensive than the cuts had been profitable.

Anthropic's Economic Primitives

There is a competing measurement framework that did not exist a year ago and is now circulating in the finance suites of the companies that actually understand the meter. Anthropic published its first Economic Index report in January 2026 and a follow-on Learning Curves report in March 2026. The team analyzed a million conversations and a million API records and proposed five economic primitives for measuring AI work. Task complexity. Autonomy level. Reliability under load. Workflow integration depth. Substitution rate against the unaided baseline.

None of these are seat counts. None of these are tokens per user. The Anthropic team was telling the market, in the most polite research language possible, that the ROI question requires a measurement language built for the meter. The January report found that almost half of jobs have seen at least a quarter of their tasks performed in Claude. The March report found that productivity gains drop sharply when reliability is factored in. The numbers that look beautiful on a CFO slide get cut in half when you ask whether the work was actually correct.

This is the receipt the consulting firms running the legacy ROI playbook are not yet pricing in. A task performed faster but checked by a human still has two workers on it. The total cost goes up. The productivity gain is a measurement artifact. The number the model returned was right. The number the math returned was wrong.

Measuring ROI on AI Initiatives After the Pilot

The ROI calculations that hold up through the meter era share three properties. They measure usage as a variable, not a constant. They measure work as task-completion under a quality bar, rather than time saved per seat. They account for the cost of the organizational redesign that the Microsoft research called sixty-seven percent of the impact.

The first property kills the per-seat pricing assumption. A CFO who priced the pilot at five thousand dollars a seat for forty users priced a two hundred thousand dollar program. The same CFO needs to model the next twelve months as a variable curve. Adoption climbs. Token use per adopter climbs. Reasoning models multiply token use by ten or twenty against the chat-only baseline. Workflows shift from one-shot answers to long agent runs that read, write, and revise across hours of compute. The pilot priced the floor. The production bill reveals the ceiling.

The second property kills the time-saved metric. A worker who saves two hours a day on email triage has produced two hours of value only when those two hours get reinvested into work that closed revenue, shipped product, or reduced cost. The McKinsey-era ROI math treated time saved as money. Time saved is a budget for new work. If the work does not materialize, the savings are imaginary. The pilots that show the cleanest ROI are the ones that pair AI deployment with a measurable shift in what the team is producing. The pilots that show ambiguous ROI are the ones where adoption grew, time saved grew, and the output charts stayed flat.

The third property kills the vendor-selection consulting engagement. Picking Anthropic over OpenAI over Google for a particular use case matters at the margin. It does not matter at the order of magnitude. The order of magnitude lives in the workflow redesign. Which decisions does the agent get to make alone? Where is the human checkpoint? How does the team learn that the agent made a bad call? How does the agent learn from the human's correction so the next call gets better? These are operating-model questions. Procurement cannot answer them. A benchmark cannot answer them.

The Pilot That Lies

The most expensive pattern in 2026 is the pilot that succeeds on the pilot metric and fails on the production metric. The shape repeats every quarter. A small team gets access to a model. They use it for a defined set of tasks. They report time saved, satisfaction, and a productivity uplift number with a confidence interval. The pilot is declared a win. Procurement signs the enterprise contract. The deployment scales to the full organization. Six months later, the dashboard shows costs up four hundred percent, productivity gains down to noise, and the original team's enthusiasm transferred to a new pilot for a different tool.

Nothing went wrong on the model side. The model did not get dumber. The vendor did not raise prices. What happened was a measurement collapse. The pilot ran in conditions that did not predict the production environment. The pilot users were self-selected enthusiasts. The pilot tasks were curated. The pilot supervisor was paying attention. None of those conditions hold in production. The variance of the user base widens. The variance of the task quality widens. The supervisor disappears. The model returns the same answers it returned in the pilot. The organization gets a different result because the organization is a different organism.

This is why the procurement question is the wrong question. The board asks which vendor to pick. The right question is what to build around whichever vendor wins. Vendors compete on benchmarks. Companies win on integration depth. The integration is the asset. The vendor is the substrate. Switching costs in the agent era will be measured by how much organizational logic lives in your tools and how much lives outside of them. Companies whose entire AI strategy lives inside a vendor's chat product are one model version away from a regression nobody asked for. Companies whose strategy lives in their own scaffolding, their own evaluation harnesses, their own task definitions, can swap models without rewriting the business.

The CFO Math That Actually Works

There is a defensible ROI framework for the meter era and it has five lines. Token cost per task, against a baseline of the unaided worker. Quality delta between agent output and human output, measured by an evaluation harness the company owns. Time-to-correction, the latency between a bad output and a fixed output that closes the loop. Substitution coverage, the percent of the task graph that runs without human touch. Organizational debt, the count of workflows that have been redesigned versus the count that still treat the agent as a faster typist.

These five lines do not produce a number that maps to the McKinsey productivity slide. They produce a number that maps to the bill the CFO is actually going to receive. The companies that adopt this framework early in 2026 will close the year with AI programs that pay. The companies that hold onto the SaaS ROI playbook will close the year with the half-billion-dollar month, the Klarna rollback, or the Uber-COO Slack message asking why the AI bill is getting harder to justify. There is no fourth outcome. The meter has been running the whole time. The only variable is how soon the CFO starts measuring against it.

The Build vs Buy Question Got Rewritten

For twenty years, build versus buy in enterprise software meant write your own app or sign a SaaS contract. AI compressed the decision into something stranger. Nobody is building a model. The question is whether the company owns the scaffolding around the model. The prompts. The evaluation harness. The memory. The tool catalog. The workflow definitions. The audit trail. None of that is the model. All of it lives in code the company writes, owns, and changes. A company that buys a turnkey AI product buys the model and rents the scaffolding. A company that builds its own scaffolding rents the model and owns the leverage. The ROI math for those two strategies looks the same on the day of signature. Two years in, the buying company is locked into a stack that gets more expensive every time the vendor releases a new feature. The building company can swap models, change vendors, and renegotiate price.

This is why the most defensible AI investments of 2026 are not the ones with the largest deployed surface area. They are the ones where the company owns the parts that touch the work. The model is becoming a commodity. The integration is becoming the moat.

What the Board Should Ask the CEO This Quarter

The right boardroom question this quarter is whether the AI measurement framework is the same one the company used in 2024. If the answer is yes, the strategy is mispriced. If the answer is that the framework has been rewritten to handle the meter, the variance, the quality bar, and the organizational redesign cost, the strategy might survive the year. If the framework is the per-seat productivity uplift slide with three case studies, the AI bill is about to introduce itself.

The companies that will publish the genuinely strong AI ROI numbers at the end of 2026 are the companies whose CFOs are arguing with their AI teams right now about what counts as a task, what counts as a completion, and what counts as a corrected error. That argument is the measurement framework. That argument is the company's defense against the half-billion-dollar month.

Why This Is an Architecture Problem

Most enterprises will not write this framework themselves. The skills required to do it well sit at the intersection of three disciplines. Workflow redesign, which lives in operations. Token economics, which lives in engineering. Evaluation methodology, which lives in research. No single internal team has all three. The consulting firms that built the last decade of digital transformation playbooks are still selling vendor selection. The system integrators are still selling deployment. The measurement gap sits in nobody's product catalog. Companies that wait for one of these incumbents to ship the answer will be measuring 2025 metrics on a 2026 bill.

The companies that close 2026 with AI programs that paid will have built the measurement system in-house, with help from the small number of advisors who understand all three disciplines. The architecture has to come first. The vendor choice follows it. The procurement closes it. The opposite order, which most companies are running today, produces the receipts that everyone is now reading about in Axios.

Agor AI Advisory builds these measurement systems for the operating model. We do not pick your vendor. We build the framework that prices your vendor honestly. We tell you which tasks belong inside the agent, which decisions belong outside it, where the human checkpoint lives, and what number the CFO should be tracking against the bill that arrives next month. We do this work with founders, with CFOs, and with operators who already know the meter is running.

Companies that built the right denominator avoid the half-billion-dollar month. Companies that keep using the wrong one keep getting it. The work to build the right one starts now.

Sources

Want this kind of automation working for your business?

Agor AI designs and ships the systems these posts describe, scoped in weeks, not quarters.

Book a Free Strategy Call