Insight

Pilot The Shell

Ariel Agor

•June 18, 2026

Listen · Read by Leo · click any word to jump

0:00 / —· loading…

On June 9, 2026, KPMG and Microsoft put out a joint announcement that ought to have stopped a thousand AI steering committees in their tracks. KPMG would deploy Microsoft Agent 365 and Microsoft 365 Copilot to all 276,000 of its professionals across 138 countries. That headline read like vendor theater until you noticed the second half of the sentence. The underlying platform, Agent 365, only reached general availability the month before, in May 2026. The deployment was not a pilot graduating after a year of careful steps. It was a switch flip across a workforce the size of a small country.

Now read the room. According to S&P Global Market Intelligence, only 31 percent of enterprises have a single AI agent running in production. Gartner finds that 80 percent of enterprise applications shipped or updated in the first quarter of 2026 embed at least one agent. A March 2026 survey of 650 enterprise technology leaders found that 78 percent have agent pilots running but only 14 percent have reached production scale. MIT's NANDA initiative reviewed more than 300 publicly disclosed deployments and reported, on August 18, 2025, that 95 percent of generative AI pilots delivered zero measurable return on investment. Eighty-eight percent of agent pilots fail to graduate.

How does one firm push agents to 276,000 desks in a month while the rest of the index spends a year watching pilots stall? The answer is not better data or better models. The answer is that KPMG did not run a pilot for the agents. It ran a pilot for the part of the system that holds the agents. The agent was a passenger. The shell was the vehicle.

This essay is about moving an AI pilot to production, and the central claim is simple. The pilot you have been running was structured to die at the boundary it was supposed to cross. The fix is to invert the subject of the pilot.

What a pilot tests, what production demands

The pilot you have probably seen runs like this. A vendor proposes a use case. A line of business sponsors it. A small team builds an agent that handles, say, contract review or supplier onboarding or first-line customer service. The team runs a demo against a curated dataset. The demo lands. The pilot is declared a success. The team writes a memo. The memo enters the slow grind of central IT, central security, central legal, and central finance. Eighteen months later the memo is still circulating and the agent has not seen a single live transaction.

What the pilot tested was the agent. What production demands is the shell.

The shell is everything that has to exist around an agent before that agent is allowed to touch a real customer, a real contract, or a real ledger entry. The shell is the registration record that says this agent exists and what it does. The shell is the permission slip that says it can call this API and not that one. The shell is the audit log that captures every prompt and every output for the seven years your regulator will want. The shell is the cost meter that throttles the agent if it spends too much on tokens this hour. The shell is the kill switch that revokes its access in under a minute when something goes wrong. The shell is the escalation route that hands a problem to a named human when confidence drops below a defined line. The shell is the rollback that puts everything back the way it was if the agent makes a mistake the customer can see.

A pilot that produces a working agent and no working shell has produced nothing that can ride into production. Every claim the agent makes will face a question the shell has to answer. Who authorized this action. Where is the record. What was the prompt. What was the output. Which model rendered the decision. What were the inputs. Who is accountable if the regulator calls. None of those questions are about the agent. All of them are about the surface that holds the agent.

Read the KPMG release literally

If you read the KPMG and Microsoft release literally, you can see the structure. Microsoft has designated KPMG a Frontier Firm, its term for organizations that redesign work around human and AI collaboration. Underneath the Copilot layer, KPMG runs an internal multi-agent platform it calls KPMG Workbench, built on Microsoft Azure AI Foundry. Workbench coordinates agents across three client-service platforms. KPMG Clara handles audit. Digital Gateway runs real-time regulatory tax analysis. KPMG Velocity drives advisory.

Notice what is named and what is not. The platforms are named. The model behind them is not. The Workbench coordinates agents. The release does not specify which agents. The Trusted AI framework is named. The use cases are not. KPMG built the shell first and tested it across whatever agents the partners wanted to load into it. When Agent 365 went generally available in May, KPMG could deploy on day one because the deployment was Agent 365 plugging into a governance and audit fabric KPMG had already proved.

This is the move that escapes the 88 percent. You do not pilot an agent and then try to wrap it in a shell when you scale. You pilot the shell and let agents come and go inside it.

JPMorgan's hidden tell

Jamie Dimon went on the record in April 2026 about JPMorgan's AI program. The bank's 2026 technology budget is 19.8 billion dollars. The AI portion sits next to data centers, payment systems, and core risk controls. Dimon said the program has self-funded through 2 billion dollars in operational savings, with a 10 to 11 percent productivity lift across engineering, operations, and fraud detection. The bank runs more than 500 AI use cases in production. Fraud false positives in anti-money laundering are down 95 percent.

Look at the structure of that sentence. Five hundred use cases in production. One budget line. One audit posture. One incident response runbook. This is not five hundred pilots. This is one shell with five hundred agents inside it. Anti-money laundering does not care which model rendered the score. It cares that every score is logged, every score is explained, and every score is reproducible on demand against a frozen state of the world. JPMorgan did not reach 500 by graduating 500 pilots. It reached 500 by promoting the shell and loading agents.

KPMG made the same move with Workbench. The financial industry got there first because regulators forced the shell to exist before the model arrived. A bank that already had model risk management committees, change management gates, and immutable audit logs had three quarters of the shell built for any agent that wanted to enter the system. A retailer that did not had to build the shell and the agent at once, and the shell takes longer.

Why most boards are buying the wrong half

Walk into a steering committee in the second quarter of 2026 and you will hear a debate about which model to standardize on. Anthropic released Claude Managed Agents in April 2026 with sandboxing, orchestration, and governance built in. Salesforce shipped Agentforce Operations to general availability on April 29, 2026, aimed squarely at back-office work. Microsoft Agent 365 hit GA in May. Each pitch comes wrapped in the promise that this time the pilot will reach production because the platform is enterprise grade.

This is the wrong question. The question is not which model your steering committee picks. The question is whether the shell your steering committee is buying belongs to your steering committee or belongs to the vendor.

If the audit log lives in the vendor's tenant, your regulator has to talk to the vendor before talking to you. If the kill switch lives in the vendor's console, your incident response team has to wait for their incident response team. If the cost meter is a quarterly invoice rather than a per-second telemetry feed, you cannot throttle anything in time to matter. If the registration of every action sits in the vendor's namespace and not yours, the receipts print on someone else's stationery.

The companies moving to production in weeks rather than years are the ones who built or bought the shell as their own. The model inside is replaceable. The agent inside is replaceable. The shell is the durable asset. The shell is the moat. Anthropic, OpenAI, Google DeepMind, Microsoft, and Salesforce will all keep shipping more capable agents on a cadence none of them control. None of them will ship you a shell that fits your regulator, your data classification, your risk register, your finance system, and your incident response runbook. That shell only fits if you designed it.

Build the shell. Rent the agent.

Here is the architectural inversion the next eighteen months will reward.

Stop scoping AI pilots around a use case. A use case is a fine excuse to start a project, but it is the wrong unit of work for a pilot. Use cases are picked by line-of-business sponsors with quarterly numbers to hit. Shells are picked by general counsels, chief risk officers, and chief financial officers with seven-year obligations to keep. A use case pilot ends in a demo memo. A shell pilot ends in a production environment.

Scope pilots around the shell instead. Pick three agents that share the same compliance profile and the same data domain. Pick the customer service agent, the contract triage agent, and the sales coaching agent if they all sit inside a single auditable surface. Pick the anti-money laundering agent, the credit memo agent, and the suitability check agent if they all run against the same risk register. Build one registration system, one audit pipeline, one identity model, one kill switch, one cost ceiling, one escalation route, one rollback. Run all three agents inside it. Whichever agent works first ships first. The other two ship a week later because the shell is already built. The fourth agent, six months from now, fits the same shell. So does the fifth. So does the agent the vendor is shipping next quarter that does not exist yet.

This is what KPMG did with Workbench. This is what JPMorgan did with its model risk fabric. This is what every firm that broke a hundred production agents in eighteen months did. None of them got there by graduating pilots. They promoted the shell.

Moving an AI pilot to production, in practice

For executives and operators wrestling with moving an AI pilot to production right now, the shift looks like four concrete moves.

First, audit your pilot inventory and sort it by shell, not by use case. Group together the pilots that share data classification, regulatory scope, and risk profile. Pilots that share a shell can ship together. Pilots that do not are individually carrying the cost of building the shell, and most of them will be killed when finance does the math.

Second, name the shell owner. The shell does not belong to the AI Center of Excellence. It belongs to a named executive with budget and authority over registration, audit, identity, cost, and rollback. In most banks this person is the chief operating officer or the chief risk officer. In most professional services firms it is the chief technology officer or the chief compliance officer. In most retailers it is the chief information security officer with an unusually long brief. If no one owns the shell, the shell does not exist, and your pilots are scheduled to die.

Third, define done for the shell before you define done for any agent. Done for an agent is a metric. Handle 70 percent of intake without escalation. Cut cycle time by 40 percent. Improve resolution accuracy by 12 points. Done for the shell is a list. Every action is logged with provenance. Every action is reproducible against a frozen state. Every action is reversible within a stated window. Every action is metered against a budget. Every action is attributable to a named human owner. Every action is gated by a permission tied to identity. When the shell is done, the next ten agents take weeks. When the shell is not done, the first agent takes years.

Fourth, treat the model and the agent as inventory, not as strategy. The model in your shell today will be obsolete inside the year. The agent on top of that model will be retired inside two. The shell will not. Strategy lives at the shell. Inventory lives at the agent. A board that confuses the two has been buying the wrong half for the last eighteen months, and is currently funding the 88 percent that never ship.

What the shell looks like when it actually works

A working shell is boring on a slide and load-carrying in production. It has seven moving parts and they fit together in a way procurement cannot assemble from a catalog.

Registration. Every agent that runs against production data has an entry in a directory the regulator can read. The entry names the owner, the data classes touched, the actions permitted, the models loaded, and the date last reviewed. If an agent runs without a registration entry, your detection system kills it.

Identity. Every action the agent takes is signed by an identity the audit log can resolve back to a human owner. Not a shared service account. Not a generic API key. An identity that, when revoked, immediately stops the agent and every downstream call the agent made.

Audit. Every prompt, every tool call, every retrieved document, every output, every cost, every confidence score. Captured. Append only. Searchable. Kept for the period your regulator requires, not the period your storage budget prefers.

Cost meter. Per call, per agent, per workflow, per business unit. Real time. Wired to an automatic throttle and an automatic kill at defined thresholds. The token bill is not a discovery for the CFO at the end of the quarter.

Kill switch. One control surface. One named human can revoke any agent or any class of agent within one minute, and the revocation is global to your enterprise. If the kill switch lives in three vendor consoles, you do not have a kill switch.

Escalation. A defined confidence floor, a defined risk class, a defined customer signal, or a defined regulatory trigger routes the action to a named human queue with a contractual response time. The human owns the resolution and the record.

Rollback. For any agent that writes to a system of record, a defined undo procedure tested on a defined cadence. Tested means executed in production, not described in a runbook.

That is the shell. That is what production wants. The agent you put inside it is whatever the vendor shipped last week. The shell is what you are paying for, and the shell is what your competitors have not built.

The bill is coming due

In 2025, 42 percent of companies abandoned most of their AI initiatives, up from 17 percent the year before. Gartner projects 60 percent of AI projects lacking production-ready infrastructure will be abandoned through 2026. Read those numbers next to the JPMorgan figure of 500 use cases in production and you can see the bimodal economy taking shape. A small number of firms are scaling because the shell is built. A large number of firms are abandoning because the shell was never going to be built in time, and what they piloted will not survive the move regardless of how clever the demo was.

The companies on the wrong side of this curve are not behind on technology. They are behind on architecture. The agent they are piloting will be free or near free by the end of the year. The shell will still cost what it costs, and the firms that built it will be three years ahead in the only direction that matters.

Conclusion

The 95 percent failure number is not a failure of the technology. It is a failure of what was piloted. Pilots that test agents against demos cannot survive the move to production because production does not want agents against demos. Production wants registered, audited, metered, reversible work. Production wants the shell. Most companies do not have a shell, and so most pilots die at the line where the shell would have started.

The firms that broke through this year share one move. They stopped piloting agents and started piloting shells. KPMG could deploy to 276,000 people in a month because the shell was already there. JPMorgan could run 500 use cases in production because the shell was already there. The agents inside those shells will be replaced again and again. The shells will compound.

This is why architecting your AI program matters more than buying a tool. A vendor will sell you an agent. A vendor will sell you a model. A vendor will sell you a platform with a logo on it. No vendor will sell you the shell that fits your regulator, your data classification, your risk register, your finance system, and your incident response runbook. That shell has to be designed for your obligations, owned by your executives, and operated by your people. That work does not happen in a procurement cycle. It happens in a strategy engagement with hands on the architecture, the policy, and the runbook at the same time.

Agor AI Advisory builds these shells with the executives who will own them. We do not sell agents. We design the registration, identity, audit, cost, kill switch, escalation, and rollback infrastructure that lets your next ten agents reach production in weeks, and your next hundred reach production at all. We work inside your governance, your data perimeter, and your compliance posture, and we leave behind a shell you own outright.

Sources

Pilot the agent vs. pilot the shell

Verifies the essay's central distinction — that the 88% pilot-failure rate is a function of WHAT was piloted, not the technology — by laying the two approaches side by side across the dimensions the post contrasts. After 15 seconds the reader concludes the shell, not the agent, is the durable unit of work and the thing worth piloting.

A use case pilot ends in a demo memo. A shell pilot ends in a production environment.
The model is replaceable. The agent is replaceable. The shell is the durable asset.
The firms that broke through stopped graduating pilots and started promoting the shell.

	Pilot the agent	Pilot the shell
Unit of workUse cases are the wrong unit; the shell is reusable, the use case isn't.	A single use case (contract review, onboarding, first-line service)	One governance surface shared by 3+ agents on the same compliance profile
Who picks itOwner determines time horizon — and the shell's owner is accountable when the regulator calls.	Line-of-business sponsor with quarterly numbers to hit	General counsel, chief risk officer, CFO with seven-year obligations to keep
Definition of doneA metric proves a demo; a list proves a production environment.	A metric: handle 70% of intake, cut cycle time 40%	A list: every action logged, reproducible, reversible, metered, attributable, permission-gated
What the pilot ends inThe memo circulates for 18 months; the environment ships agents in weeks.	A demo memo that enters the IT, security, legal, and finance grind	A production environment the next agents plug into
What compoundsStrategy lives at the shell; the agent is replaceable inventory.	Nothing — the model is obsolete in a year, the agent retired in two	The durable asset — the next ten agents reach production in weeks, the next hundred at all

Source: Synthesis of the essay's own explicit pilot-the-agent vs pilot-the-shell contrasts; every cell is quoted or paraphrased directly from the post body. · verified · as of 2026-06-18

Want this kind of automation working for your business?

Agor AI designs and ships the systems these posts describe, scoped in weeks, not quarters.

Book a Free Strategy Call