Insight

Counting the Lights

Ariel Agor

•May 29, 2026

Listen · Read by Leo · click any word to jump

0:00 / —· loading…

A Microsoft 365 Copilot admin dashboard in May 2026 looks great. Microsoft itself reports 28 million paid enterprise seats globally in Q1 2026, 67 percent of licensed users opening the tool every workday, and 11.3 Copilot interactions per user on average across the enterprise base. On May 14, 2026, the GitHub Changelog added team-level Copilot usage metrics via API for even more granularity. Lights everywhere.

In the same window, McKinsey's most recent State of AI work estimates that roughly 6 percent of companies attribute more than 5 percent of EBIT to AI. The MIT NANDA initiative report from August 2025, now widely confirmed in the first two quarters of 2026, puts the failure rate of enterprise generative AI pilots at 95 percent. Lots of lights. Almost no impact.

Both sets of numbers are honest. The Microsoft data is real. The McKinsey and MIT failure data is real. The gap between them is the entire story of enterprise AI right now, and it lives inside every P&L review where the dashboard says winning and the financials say nothing happened. The question for every executive sitting on top of an eight-figure AI budget is which set of numbers describes the company they actually have.

The story of enterprise AI in May 2026 is the story of AI adoption metrics that matter not matching the ones we collect.

The dashboard is honest and useless at the same time

Vendors are not gaming the numbers. The numbers vendors give you are the wrong numbers, and the right ones are harder to surface.

Microsoft reports 28 million paid Microsoft 365 Copilot seats globally in Q1 2026, up from 12 million a year prior. Anthropic crossed OpenAI in business adoption in April 2026 per the May 2026 Ramp AI Index, with Anthropic at 34.4 percent of businesses and OpenAI at 32.3 percent. TechCrunch and Axios broke the milestone on May 13. Anthropic hit roughly $45 billion in annualized revenue by mid-May, with the number of customers spending over $1 million annually doubling from 500 to over 1,000 in under two months. These are not marketing numbers. They are tracked spend through a corporate card platform.

In the same period, McKinsey's State of Organizations 2026 work and its earlier State of AI cuts show that the share of companies reporting material EBIT impact from AI sits near 6 percent. The MIT NANDA report puts the failure rate of GenAI pilots at 95 percent. Snowflake's Radical ROI of Generative AI 2026 found that only 29 percent of organizations report significant ROI from generative AI and just 23 percent from agents. Sixty percent report no enterprise-wide financial impact at all.

Adoption at 88 percent. Value capture at 6 percent. The gap is what each company has chosen to measure between them.

The four families of metric most companies actually track

Walk into the AI program review at any mid-cap or larger company today and you will see one of four families of metric on the board. None of them tell you whether the program is working.

The first family is seat counts. Licenses purchased, licenses provisioned, licenses active. CloudEagle, Zylo, and Flexera all publish data showing 25 to 35 percent of seat-based AI licenses go unused at any given quarter. For a thousand-person company at $30 per user per month, that is $90,000 to $126,000 wasted on a single tool, every year. Seat counts measure procurement, not value.

The second family is activity. Logins per week, prompts per user, sessions per day. The Microsoft Copilot admin dashboard surfaces 11.3 interactions per workday for the average enterprise user. That is activity. It is not outcome. A user can run eleven Copilot prompts a day on meeting summaries that no one reads and that change no decision the company makes.

The third family is sentiment. Net Promoter Scores on internal AI tools. Engagement scores from change management vendors. These are useful for tracking the political weather inside the company, which has its place. They cannot tell you whether the program is paying for itself.

The fourth family is anecdote. The lawyer who saved fourteen hours on a brief. The marketer who shipped three campaigns in the time of one. The analyst who got the model done 30 to 40 percent faster, per Microsoft's own enterprise studies on Excel financial modeling. The drafter who got Word documents out 50 to 60 percent faster. These wins are real. Snowflake's 2026 ROI cuts show the median early adopter reports $1.49 returned per $1 spent inside a specific use case. Anecdotes are a leading indicator that a use case is plausible. They are not evidence that a company is winning at AI.

Stack the four families together and you get a beautiful dashboard. It will tell you nothing about whether your company will be standing in three years.

AI adoption metrics that matter

The phrase "AI adoption metrics that matter" is being searched a lot right now. What every honest executive is asking when they search it is: how do I know if we are wasting eight figures? Four bands of metric answer that question. Each is harder to measure than seat counts. Each is closer to the truth.

Outcome density

How many discrete business outcomes did AI directly cause this month? Not assist. Not contribute to. Cause. A loan approved without a human review. A support ticket closed end-to-end. A purchase order generated, validated, and sent. A piece of code merged to main.

The unit matters. It should be denominated in the thing your company already counts: tickets, orders, claims, drafts, decisions. Outcome density is the volume of AI-caused units divided by total units in that category. If your contact center fielded a hundred thousand tickets in May and AI closed fifteen hundred of them end-to-end, outcome density is 1.5 percent. That is a real number. It can be tracked, it can be moved, it can be tied to a cost line in next quarter's budget.

Most companies cannot report this number. They cannot report it because the AI sits in a chat window beside the workflow rather than inside the workflow. The MIT NANDA finding that 95 percent of pilots fail is, at the operational level, the finding that 95 percent of pilots cannot produce an outcome density number because they were never wired into the system of record.

Cycle compression on a named end-to-end process

Pick one cross-functional process. Quote-to-cash. Hire-to-onboard. Incident-to-resolution. Measure end-to-end cycle time before AI. Measure it after. Hold the scope constant.

Most reported AI productivity gains live inside a single role. Excel modeling 30 to 40 percent faster. Word drafting 50 to 60 percent faster. These role-level wins are real, and they are why McKinsey's "performance paradox" exists at the enterprise level. Companies report local productivity gains and zero enterprise EBIT impact at the same time, because the gain happens inside one role and the friction sits in the handoff between roles.

Cycle compression on a named end-to-end process is the metric that catches what role-level wins miss. If the analyst gets the model done 40 percent faster but the deal still takes seventy days to close because it sits in legal review for three weeks, the company has paid for AI and bought nothing.

Cost-to-serve delta

For any unit of output your company produces, what is the fully loaded cost to serve before and after AI is deployed? Fully loaded means labor, tooling, infrastructure, and the AI itself.

This is where the Snowflake $1.49-per-$1 figure becomes useful or useless depending on the denominator. If the $1 you spent is the Copilot license and the $1.49 you "saved" is the time of an analyst who stayed on payroll anyway, you have moved zero dollars in the P&L. If the $1 is fully loaded program cost and the $1.49 is verifiable cost takeout that landed in the budget, you have a result.

McKinsey's much-quoted line that for every $1 of technology investment, $5 should be spent on people refers to this denominator problem. The $5 is the change management, the workflow redesign, the data integration, the role redefinition, and the governance work that allow the $1 of tech to actually displace a unit of cost. Companies that skip the $5 spend the $1 and measure nothing.

Decision velocity on tracked decisions

The latency between question asked and decision made. Tracked decisions only. Procurement approvals over a threshold. Hiring decisions. Pricing exceptions. Underwriting calls. Claim adjudications. Whatever your company already tracks because regulators or auditors require it.

This metric exists because delay is more expensive than labor at most large companies. Capital sitting in inventory. Deals waiting in legal review. Bids missed because the response window closed. Customers churning during a thirty-day support backlog.

AI that compresses decision latency on a tracked decision class shows up in the P&L because the underlying cost was already in the P&L as working capital, lost revenue, or accrued risk. AI that fails to compress decision latency on tracked decisions can be deployed at any scale without moving the financials.

Why most companies measure the wrong things

Three reasons, none of them stupid.

First, the wrong metrics are the ones vendors hand you. Microsoft, Google, Anthropic, OpenAI, and GitHub all surface seat and activity data through their admin consoles. They surface it because that is what they collect. The metrics that matter (outcome density, cycle compression, cost-to-serve delta, decision velocity) require integration with your own systems of record. No vendor can give them to you because no vendor has them.

Second, the wrong metrics are the ones change management consultancies have been selling for two decades. Adoption curves, sentiment, training completion rates. These were the right metrics for SaaS rollouts in 2008. They are catastrophic metrics for AI deployments in 2026. The 2008 methodology was built for tools that produced no outputs of their own. AI tools produce outputs continuously, and the only useful question is whether those outputs are doing work.

Third, the wrong metrics are easier to defend in a board review. "We have 28 million users" wins the slide. "We have 1.5 percent outcome density on Tier 1 support tickets" loses the slide. The first feels like a victory. The second feels like an admission. The first is a vanity metric. The second is the AI adoption metric that actually matters, because it ties to a cost line and a baseline you can move.

The compounding problem

There is a reason this gap is suddenly urgent.

In February 2026 the public markets repriced SaaS. About $285 billion in market capitalization evaporated when investors recognized that per-seat pricing was structurally exposed to seat compression. The thesis was simple. If an AI agent does the work, the seat is not needed, and seat-priced software loses pricing power. Salesforce, ServiceNow, and the seat-priced incumbents took the brunt.

For the companies inside that repricing, the path forward is to prove their AI features generate outcomes that justify a different pricing model. For the companies buying those tools, the implication runs the other way. Every per-seat license you hold is now a bet that the seat will produce more value than the agent that could replace it.

If you cannot measure outcome density on a per-seat basis, you cannot make that bet rationally. You are paying for seats and hoping. Hoping is not a strategy when the underlying unit economics are shifting in real time. The Anthropic Economic Index report from March 2026 found that roughly 49 percent of jobs already have at least 25 percent of their tasks done by Claude. That is the bet, priced.

What an executive can do this quarter

Pick a single process. Quote-to-cash. Ticket-to-resolution. Claim-to-payout. Hire-to-onboard. Whatever has the highest variable cost and the worst cycle time in your business.

Instrument it end-to-end. Baseline the four metrics that matter before any model is turned on. Today's outcome density. Today's cycle time. Today's fully loaded cost-to-serve. Today's decision velocity on the tracked decisions inside the process. Write those numbers down. Sign them off with finance.

Then deploy AI inside the workflow, not beside it. The companies seeing P&L impact in the Anthropic and McKinsey data are the ones that have wired the model into the system of record, so its output flows downstream automatically and the next system picks it up without a human re-keying anything. The companies seeing only adoption metrics are the ones running chatbots in a side panel while the underlying process runs the same way it ran in 2019.

Re-measure the four metrics ninety days later. If outcome density rose, cycle time fell, cost-to-serve dropped, and decision velocity climbed, scale the deployment. If the metrics did not move, kill the deployment without a debrief and try a different process. Killing fast is cheaper than the slow bleed of an eight-figure program that produces a beautiful dashboard and nothing else.

Why this needs architecture, not procurement

The implicit assumption inside most AI rollouts is that the right tool will produce the right outcome. The MIT NANDA report rejects this directly. Among the 95 percent of pilots that fail, the dominant cause is not model quality. It is the gap between where the model sits and where the work happens.

Closing that gap is an architectural problem, not a procurement problem. It requires picking the process, mapping the systems of record, wiring the model into the workflow with the right authority and the right guardrails, building the telemetry that produces the four metrics that matter, and rebuilding the role definitions of the humans who used to do the work. None of that is in the Microsoft or Anthropic SKU. All of it is in the gap between the SKU and the P&L.

No vendor sells this. Vendors sell tools. The work of turning tools into outcomes has been outsourced at most companies to a generic consultancy with a 2008 change management methodology, and that methodology is what gets you a 95 percent failure rate.

This is the gap Agor AI Advisory is built to close. We architect the deployment from the process backward, treating the tool as the last decision rather than the first. We instrument the four metrics that matter (outcome density, cycle compression, cost-to-serve delta, decision velocity) before the AI is turned on, so the program can be killed or scaled on evidence rather than on the political weather inside the room. We work alongside the executive team. We do not hand you a deck and disappear.

The eight figures you are about to spend on AI in the next twelve months will either show up in your P&L or in your activity dashboard. There is no third option. The companies that decide which one in advance will own the next decade. The companies that hope are paying for someone else's training data.

Sources

The Four AI Metrics That Actually Land in the P&L

Verifies the post's central taxonomy claim — that the four 'metrics that matter' each tie to a specific cost line, unlike seat counts and activity dashboards. After 15 seconds the reader sees every metric has a concrete unit, a worked example, and a balance-sheet destination, and realizes their current dashboard reports none of the three.

Adoption at 88%, value capture at 6%. The gap is what each company chose to measure between them.
If you can't name the unit, the worked example, and the P&L line, it's a vanity metric.
Each of these is harder to measure than a seat count. Each is closer to the truth.

	What it counts	Worked example from the post	Where it lands in the P&L
Outcome densityMost pilots can't report it because the AI sits beside the workflow, not inside the system of record. That gap is the operational form of the 95% failure rate.	AI-caused business outcomes ÷ total units, denominated in what you already count: tickets, orders, claims, drafts, decisions	100,000 tickets in May, AI closes 1,500 end-to-end → 1.5% outcome density	Ties directly to a cost line in next quarter's budget
Cycle compressionRole-level speedups (Excel 30-40% faster, Word 50-60% faster) are real but die in the handoff between roles.	End-to-end cycle time on one named cross-functional process, scope held constant before vs. after	Analyst models 40% faster, but the deal still takes 70 days because it sits 3 weeks in legal review	Catches McKinsey's 'performance paradox': local gains, zero enterprise EBIT
Cost-to-serve deltaMcKinsey's $5-on-people-per-$1-on-tech is this denominator: skip the change-management spend and you spend the $1 and measure nothing.	Fully loaded cost per unit of output before vs. after AI — labor, tooling, infrastructure, and the AI itself	Snowflake's $1.49 per $1 only counts if the $1 is fully loaded program cost and the $1.49 is takeout that landed in the budget	A real result only when verifiable cost leaves the P&L, not when an analyst stays on payroll
Decision velocityCan be deployed at any scale without moving the financials if it fails to compress latency on a tracked decision.	Latency from question asked to decision made, on tracked decision classes only: procurement, hiring, pricing, underwriting, claims	Delay costs more than labor — capital in inventory, deals stuck in legal, bids missed when the window closes	Shows up because the cost was already booked as working capital, lost revenue, or accrued risk

Source: The four metrics defined in the post's 'AI adoption metrics that matter' section; worked examples and figures (1.5% outcome density, 40% faster / 70-day deal, Snowflake $1.49 per $1, McKinsey $5-per-$1) drawn verbatim from the same post body and its cited sources. · verified · as of 2026-05-29

Want this kind of automation working for your business?

Agor AI designs and ships the systems these posts describe, scoped in weeks, not quarters.

Book a Free Strategy Call