← Back to Insights

Insight

The Bill Comes From The Loop

Ariel Agor
The Bill Comes From The Loop

Listen · Read by Leo

0:00 /

On June 30, 2026, the Experiences and Devices division at Microsoft will pull Claude Code from thousands of its engineers. Six months after the pilot started. The decision was reported by Cybernews and eCorpIT in the same week the Wall Street Journal reported OpenAI weighing steep price cuts to defend enterprise share against Anthropic. The same Microsoft, by the way, that joined Nvidia in a five billion dollar investment in Anthropic earlier this year. Microsoft did not cancel Claude Code because the model was bad. It cancelled because the bill arrived and the math did not work inside their division.

Read that sentence again. Inside the company that helps train and ships and resells the model. The math did not work.

If it did not work for them, your spreadsheet is wrong too.

The TCO sheet was built for a different shape

For three decades the AI total cost of ownership conversation ran on rails laid by enterprise software. You bought a license. You sized infrastructure to a known peak. You amortized over five years. You added a maintenance percentage somewhere between fifteen and twenty-five. The variable line was small and the fixed line was the story.

Look at the working budget for any enterprise AI program written before the agentic shift and you will see the same thing. A per-seat price for the model. A flat cloud commitment. A few full-time roles for integration. A small slot for inference compute, often underestimated. The shape is a software budget. The number on the page is recognizable to a CFO who has been signing renewals since SAP.

This shape was wrong on the day it was written. It is now wrong in a way that takes companies down.

The shift is mechanical. Chatbots answer once. Agents read context, plan, call a tool, read the result, plan again, call another tool, verify, sometimes back up and try a different branch. Industry analysis published this spring and summarized by Goldman Sachs and EY puts the token consumption of agentic systems at five to thirty times the consumption of a chatbot on the same task. Some orchestrated systems run thirty times the cost per interaction compared to the chatbot baseline from 2023. An agent runs a loop. A chatbot answers once. The loop sends invoices.

What Uber paid to find out

The cleanest public proof of this shift came out of Uber. After deploying Claude Code and Cursor to roughly five thousand engineers, the company burned through its entire 2026 AI tools budget of three point four billion dollars in four months. Reports out of eCorpIT and Cybernews put monthly per-engineer API spend at between five hundred and two thousand dollars by April, with usage rates between eighty-four and ninety-five percent. The internal warning was that the remaining budget might not last the year.

Five thousand engineers. Four months. Three point four billion dollars.

Decompose that number and the message is clear. The line item that was supposed to be the rounding error became the dominant cost. The per-seat assumption assumed a constant rate of consumption per engineer per month. The actual rate was a function of how much each engineer let the agent loop, how often the agent decided to read another file, how many tools the agent invoked, how many times the agent verified its own work. None of that is a per-seat number. It is a per-token number. It runs in the background while the engineer is at lunch.

The same pattern shows up across the industry. A 2025 survey cited at FinOps X 2026 in San Diego found that eighty-five percent of companies miss AI cost forecasts by more than ten percent. Nearly a quarter underestimate by fifty percent or more. The S&P Global figure that everyone has been repeating since spring is forty-two percent of enterprises abandoned most of their AI initiatives in 2025, up from seventeen percent the year before. The headline reads as a story about ROI. Look closer and a significant share of those abandonments are budget capitulations. The pilot worked. The bill arrived. The CFO said no.

The price cuts do not save you

On June 9, Anthropic released Claude Fable 5 at ten dollars per million input tokens and fifty dollars per million output tokens, half the rate of the prior Mythos Preview. On June 11, the Wall Street Journal reported that OpenAI was weighing steep cuts to defend enterprise share, a story Bloomberg and CNBC ran the same day. Chinese providers continue to undercut both by up to nine times. Per-token prices are falling at roughly eighty percent year over year on a like-for-like basis. The headline says AI is getting cheaper.

Cheaper per token. More expensive per task.

That sentence is the cost story of 2026. Tokens have collapsed in price. Tokens consumed per task have exploded. Goldman Sachs estimates that token consumption will multiply twenty-four times between now and 2030, reaching one hundred and twenty quadrillion tokens per month. By 2040, if enterprise agents reach full-scale adoption, the figure could hit fifty-five times current levels. Goldman calls it the agentic economy and identifies the first half of this year as the profit inflection point for the infrastructure layer. That is the supply side getting healthier. The demand side, which is you, gets a fatter bill.

A cheaper unit times a much larger quantity is not a savings. It is a category change. Your AI spend stops behaving like a software line and starts behaving like a utility line. The right comparison is no longer SAP. The right comparison is electricity.

The nine cost buckets nobody put in the model

At FinOps X 2026 the foundation introduced AI Tokenomics as a discipline and reframed the cost of AI as nine distinct buckets, not one. The token invoice is bucket one. The other eight cover reserved compute for self-hosted inference, networking and data egress for retrieval traffic, vector store and metadata storage, observability and tracing pipelines, evaluation and red-team infrastructure, governance and audit tooling, model lifecycle work for retraining and version rollover, and the human cost of running all of it. Most enterprise TCO models still itemize one and bury the other eight inside a generic line called integration.

Reread your AI program budget against that list. If your model says four hundred thousand dollars for an initial build and another fifteen percent annually, you are looking at bucket one. The actual three-year TCO for enterprise AI now sits at one and a half to two times the initial build cost when the other buckets are honestly included, with annual operating cost running at fifteen to twenty-five percent of build cost just for maintenance. The total annual exposure for a mid-sized enterprise running production agentic systems, by the analysis from FourWeekMBA and others this month, lands between nine and nineteen million dollars per year. One to three million in inference compute. One point eight million in seat licenses. Two to five million in cloud commitments. Two to five million in custom deployment. Two to three million in internal team cost.

That is the working number. Not the four hundred thousand on the slide your CIO put up in January.

The shape of an AI total cost of ownership model that survives June

Here is what the new model has to do.

It has to meter at the token level, by agent, by user, by tool call. Not by seat. The per-seat number is a fiction the moment one user runs an autonomous workflow that triggers fifteen tool calls per minute for six hours.

It has to forecast variable. Cloud committed-use discounts and per-seat license sheets assume known peak. Agent loops do not have a known peak. They have a probability distribution over a workload that itself shifts with new tools and new instructions. The forecasting layer needs to handle that. The platforms catching up to this reality, Finout and Mavvrik and Amnic and Revenium among them, all chose the same architecture in the last six months. Real-time consolidation of provider invoices, virtual tagging back to teams and cost centers, anomaly detection on token spend, and policy gates that can cut off a runaway agent before it cuts off the budget.

It has to chargeback at the agent level. The next twelve months of internal politics around AI cost will be about who pays when an agent serving Sales reads the entire customer data lake to answer one question. Without per-agent chargeback the answer is the platform team, and the platform team will quietly stop letting agents read the lake. Then the agent gets dumber and the Sales team starts complaining that AI does not work. You have seen this movie. It ends with the pilot getting cancelled and somebody quoting the forty-two percent number.

It has to budget for the eight buckets that are not the token invoice. The vector store grows in proportion to how much you index, not how much you query. Observability gets more expensive as agents get more autonomous, because you need finer-grained traces to debug them. Red-team infrastructure stops being a quarterly compliance exercise. It becomes a continuous pipeline running adversarial workloads against your production agents to find the policy holes before a customer does. Each of these is a real annual line. None of them fit in the integration bucket.

It has to model architecture choices as cost choices. The Lenovo Press analysis updated for 2026 puts on-premises inference at roughly eight times cheaper per million tokens than cloud infrastructure-as-a-service and as much as eighteen times cheaper than frontier model-as-a-service APIs. The breakeven on high-utilization workloads is under four months. That is a real choice your CFO can act on, and it does not appear anywhere in a spreadsheet that treats AI as a vendor product. The five-year savings per server on a self-hosted high-utilization workload can exceed five million dollars. That is the savings you do not capture if you buy your TCO model from a vendor that sells you APIs.

Why Microsoft cancelled and what it tells you

Microsoft did not cancel Claude Code because Claude was the wrong model. The pivot to GitHub Copilot CLI uses, in many configurations, the same underlying models routed through different infrastructure and different pricing.

Microsoft cancelled because the procurement contract was wrong. The pricing was per-seat with no usage cap. The seats went to engineers who used the tool the way agentic tools want to be used, which is constantly, and the tokens consumed per engineer scaled in ways that the per-seat price could not absorb. The contract was sized for a chatbot. What arrived was a loop.

The fix sits in the cost architecture, not the model. Move to a metered-usage contract. Route by task to the cheapest model that can do the work. Cache retrieved context aggressively. Cap agent loops at a maximum depth. Move long-running internal workloads to self-hosted inference on dedicated hardware where the unit cost is a fraction of API rates. None of these are model decisions. They are architecture decisions and they show up in the AI total cost of ownership line as a difference of an order of magnitude.

If you have not yet rebuilt your TCO model for the agentic shift, your current model is wrong by between three and ten times. That is the gap between the per-seat forecast and the per-token reality. The gap is what closed inside Microsoft on June 30. It is what closed at Uber by April. It is what will close at your company some time between now and the next board meeting unless somebody does the structural work.

The work nobody wants to do

The honest reason most companies have not done this is that the work is unglamorous. Wiring up real-time token observability across OpenAI, Anthropic, Gemini, Bedrock, Vertex, and any self-hosted runtime is plumbing. Building chargeback at the agent level requires schema work in your billing and identity systems. Negotiating metered-usage contracts with vendors who would rather sell you flat-rate seats requires legal and commercial muscle. Standing up a self-hosted inference cluster for the high-utilization workloads requires capital expenditure and operating discipline most software organizations have not exercised in a decade.

This is the work that separates the eighty-eight percent of enterprise AI pilots that never reach production from the twelve percent that do. It is also the work that separates the forty-two percent of companies scrapping most of their AI initiatives from the ones that compound. The technical decisions get the conference talks. The cost architecture decisions get the survival.

There is no off-the-shelf tool that solves this for you. Finout consolidates token spend. Amnic does cost allocation. Mavvrik does GPU and inference observability. Revenium does runtime governance. Each of them solves a slice. The synthesis, the part where your token invoices roll up to your business unit P&L and your agent depth caps tie to your engineering quotas and your retrieval cache hit rate ties to your gross margin, that synthesis is the architecture work, and it has to be done in your stack against your processes by people who understand both the cost side and the agent side at the same time.

What we do at Agor AI Advisory

The argument here is not about which model to pick. It is about whether the cost machinery your company is running was built for the workload that is actually arriving. For most companies, the answer is no. The procurement contract assumes a chatbot. The forecasting model assumes a per-seat curve. The chargeback system assumes a fixed cost center. The observability stack reports cloud spend but not token spend. The vendor relationship treats AI as a SaaS product. Every one of these assumptions is breaking right now, in production, on monthly invoices that nobody saw coming.

Architecting the alternative is what we do. Not picking a vendor. Not running a pilot. Building the cost architecture that handles agents, metering at the right grain, deciding which workloads run on self-hosted hardware and which run on API, negotiating the contract shape that protects you when usage scales, instrumenting the observability that flags a runaway loop in minutes instead of months, and tying every line of token spend to a business outcome you can defend in a board meeting.

The companies that survive the next two years of the agentic shift will not be the ones with the best model. They will be the ones whose AI total cost of ownership model matches the workload they are actually running. Microsoft just told you the deadline on figuring that out is shorter than you think.

Schedule a strategic consultation with us today.

Sources

Want this kind of automation working for your business?

Agor AI designs and ships the systems these posts describe, scoped in weeks, not quarters.

Book a Free Strategy Call