← Back to Insights

Insight

Production Was the Trial

Ariel Agor
Production Was the Trial

Listen · Read by Leo · click any word to jump

0:00 / · loading…

On May 13, 2026, Sinch published a study built on responses from 2,527 senior decision makers across ten countries and six industries. The headline number was the one OpenAI, Anthropic, Google, and every consulting firm with a slide deck would prefer you ignore. Seventy-four percent of enterprises that had taken a live AI customer communications agent into production had already rolled it back.

The paradox sits one layer below. The rollback rate among companies with mature guardrails climbed to eighty-one percent. Better governance did not prevent the failure. It caught it sooner. The well-run organizations were not winning the production race. They were seeing the wreckage first and choosing to stop.

This is the chart vendors never put on the slide. It is the receipt for what actually happens to generative AI business use cases six weeks after the press release goes out.

The Numbers Nobody Wants Cited

The Sinch finding is not an isolated reading. The Register ran the story under a headline that does not mince words: three-fourths of AI customer service rollouts are a letdown. Customer Experience Dive tied the rollback rate to the structural mismatch between live customer load and pilot conditions. The data is consistent across outlets, and the underlying study is methodologically clean. Two and a half thousand senior buyers, with answers segmented by guardrail maturity, industry, and infrastructure investment.

Sit with the framing for a moment. Ninety-eight percent of the same buyers still plan to grow AI investment in 2026. Seventy-six percent are redirecting that spend toward trust, security, and compliance. The field is doubling down on the layer of the stack that the rollback exposed as missing. That is a tell.

Companies are not pulling back because they decided AI cannot work. They are pulling back because the part of the system that holds a deployed AI agent together in production was never built. It was assumed.

Where the Generative AI Business Use Cases Actually Break

If you read the marketing literature for generative AI business use cases, the failure modes are absent. Vendor case studies stop at week six. Conference talks stop at the demo. The earnings call narrative stops at "agentic transformation underway." None of these accounts make it as far as month four, which is where Sinch's respondents pulled the plug.

The Sinch report lists the top causes of rollback explicitly. Customer data exposure leads at thirty-one percent. Hallucination and brand risk follow at twenty-two percent. Both are failure modes that only appear at scale, in front of real customers, under conditions a pilot deliberately suppresses.

A pilot runs on a curated dataset. A pilot is shown to friendly users. A pilot is monitored hour by hour by the team that built it. None of those conditions hold in week seventeen of a production deployment with two million monthly conversations. The cost of running a pilot, paid in cash, was the smaller cost. The hidden cost was that the pilot lied about what a deployed agent looks like.

When the failure surfaces in production, it surfaces fast. Customer data leaks into a response. The agent confabulates a refund policy that does not exist. A regulator notices. A reporter notices. A board notices. The rollback then becomes a six-hour decision, not a six-month one.

This is the structural reason mature guardrail teams pull back at a higher rate. They see the failure. Teams without mature guardrails are sitting on the same failures with no signal yet. The eighty-one percent number is what visibility looks like when the rest of the field has none. The well-instrumented teams are the only ones who know what is actually happening to their generative AI business use cases under live load.

The Klarna Receipt

Klarna is the case the industry already has to argue with, because the receipt is on the record.

Between 2022 and 2024, Klarna replaced roughly seven hundred customer service positions with an AI assistant built with OpenAI. CEO Sebastian Siemiatkowski said the assistant was doing the work of seven hundred agents, and that AI would dissolve large parts of the workforce. The story became the index case for the whole "AI replaces customer service" narrative.

By the spring of 2026, Klarna was hiring customer service agents back. Entrepreneur's reporting on the reversal is direct about why. The AI could not handle the calls that mattered. Customer satisfaction dropped on complex interactions. The projected cost savings did not fully arrive. Siemiatkowski admitted publicly that the cuts went too far.

The company did not abandon AI. It built a hybrid where the AI takes routine queries and humans take escalation, nuance, and high-value relationships. That hybrid is the architecture the original deployment skipped. It is the layer the press release ignored. It is what production demanded and the pilot hid.

Gartner read the same signal and, on February 3, 2026, forecast that fifty percent of companies that attributed headcount reduction to AI will rehire staff to perform similar work by 2027, often under different job titles. The Sinch report confirms three months later that the rehire wave is already underway, quietly, in the form of rollback first and headcount second.

Why the Mature Teams Pull Back First

The most counterintuitive line in the Sinch findings is the eighty-one percent rollback rate inside fully governed AI programs. The reflex reading is that governance breaks AI. That reading is wrong.

A team with mature guardrails has the telemetry to see a customer data leak the first time it happens. A team without that telemetry sees it the day a journalist calls. The first team rolls back in week eight. The second team is still in production at week thirty, accumulating a liability they cannot measure. Both have the same broken system. Only one is being honest about it.

Governance does not cause the failure. Governance is the telescope on a failure that always existed. The teams without rollbacks are flying blind, and the bill is on a delay.

The boardroom translation matters. When the CFO asks why the mature program has a higher pullback rate than the cheaper one, the right answer is that the mature program is the only one that has finished the experiment. The cheap program is still gathering data on a process that will produce a worse outcome later, in public.

What the Vendor Deck Hides

Walk through any current pitch for an enterprise AI agent. The slide order is roughly identical. Use case. Demo. Time-to-deploy. Reference customer. ROI projection. Roadmap.

Notice what is missing. There is no slide for rollback rate. There is no slide for which week the failures concentrate in. There is no slide showing how often the reference customer is still running the system. The vendor's reference customer list is a snapshot, frozen at the moment of deployment, never updated.

AIntelligenceHub's coverage of the Sinch numbers traced the gap back to a class of failure modes the demo cannot show. Novel edge cases. Identity verification under attack. Multi-turn conversations that drift. Policy queries that change between versions of the model. None of these surface in a guided demo. All of them surface in week six.

The failure class has a name in the engineering literature. The failures are emergent. They appear at scale because scale is what generates them. A pilot with a thousand conversations does not stress the long tail. A production deployment with two million conversations finds the long tail in two weeks.

The vendor cannot price what the vendor cannot see. The buyer cannot reject what the buyer was never shown. The contract gets signed on a fiction, and the fiction breaks on first contact with the customer.

The Cost Model Quietly Inverts

A second hidden curve runs underneath the rollback data. The cost of a deployed agent comes from staying deployed, well past inference.

A pilot has a fixed annual budget. A production deployment has a continuous cost. New incidents require new evals. Each new policy requires a new red-team pass. Each model upgrade resets the previous validation. Each customer data exposure event triggers a legal review, a notification process, possibly a regulatory filing under GDPR or California's CPRA.

Inference at GPT-class prices is roughly free per query. Maintenance is not. The vendor priced the cheap part. The expensive part is the one they did not quote.

This is why Sinch's respondents are redirecting seventy-six percent of new spend toward trust, security, and compliance. They learned the cost shape from the rollback. The first deployment cost them the headline. The next one will cost them the architecture.

A Production AI Is Not a Pilot With More Users

The deepest misread in the current enterprise AI buying cycle is the assumption that a production AI agent is the same agent that ran in pilot, simply with the access controls removed. This is wrong at the level of physics, not engineering.

A pilot is bounded. It has known inputs, known users, known queries, known outcomes. A production agent is unbounded. Its input distribution is whatever a customer types at three in the morning. Its query space is whatever a competitor probes with a red-team script. Its outcome distribution is whatever a regulator audits a year later. The two systems are different objects.

The vendor sells the bounded one and tells the buyer to deploy it as the unbounded one. The rollback is the moment the assumption breaks. The team learns that what they bought was a demonstration, and what they needed was a process.

The architecture for the unbounded system is harder, and the architecture is the actual product. It includes a labeling pipeline, an eval suite that runs continuously against production traffic, a guardrail layer that owns refusal behavior, a logging system that captures the failure before the customer does, a human-in-the-loop for the long tail, and a rollback plan that is rehearsed, not improvised. None of those line items appear on the vendor invoice. All of them are required to keep the system in production past month three.

A team that builds these layers can run a deployed AI agent at month four with the same confidence they ran the pilot at week six. A team that skips them runs the agent until the failure surfaces, then rolls back. The Sinch data is what skipping looks like at the population scale.

The Architecture of Surviving Month Four

The generative AI business use cases that survive are the ones built on the assumption that production is the actual trial. They share a small number of structural features.

The first is a continuous eval loop. Sample real production traffic, label it against ground truth or against a stricter model, and grade the deployed agent every week. The grade feeds the kill switch decision, not the executive deck. When the grade falls below a threshold, the agent gets pulled back to a narrower scope automatically, before a customer-facing incident forces the issue.

The second is a tiered scope. The agent runs at restricted scope first, gradually widening. Each widening triggers a new eval. The team owns the widening decision. The vendor does not.

The third is a recovery path. A clear, rehearsed path back to the human-only system. The team has run the rollback in staging. The customer-facing language is pre-written. The legal notification template is in the runbook. When the rollback decision happens, it takes hours, not weeks. The Klarna reversal cost Klarna its narrative because the rollback was improvised. A planned rollback costs much less.

The fourth is a clear ownership line. One person, named, accountable for the production agent's behavior. The vendor's accountability ends at the API contract. The buyer's accountability ends at the customer.

The fifth, and most overlooked, is honesty about the failure rate. The team plans for the rollback the way a pilot plans for a missed approach. It is a procedure. Most of the failure modes are predictable, and the architecture pre-commits to a response.

A program built this way still might roll back. The Sinch number suggests it probably will. But it rolls back fast, with a known surface, and with the next architecture already in motion. It does not lose the customer. It does not lose the board's trust. It does not become a Klarna headline.

The teams that buy a tool and ship it do not have this architecture. They have a vendor relationship and an SLA they cannot enforce. The rollback for them is the discovery event. By the time they understand what they bought, the agent is already on a customer call, hallucinating a refund.

What the Spend Pattern Says

The most useful number in the Sinch dataset is not the seventy-four percent or the eighty-one percent. It is the seventy-six percent. Three quarters of the buyers who have already lived through a rollback are now redirecting fresh spend toward trust, security, and compliance.

That is the architecture line item arriving on the budget, paid for in retrospect. The buyers are not buying more model. They are buying the part of the system that should have been built before the first deployment. They are paying for the missing layer at a premium, in crisis conditions, with a customer-facing incident on the record.

The lesson is portable. The architecture costs less when it is built first. The vendor will not quote it. The model will not provide it. It has to be designed for the company that runs it, with the policies that company is actually obligated to follow, against the customers that company actually serves, under the regulators that company actually answers to.

The headline rollback rate will keep climbing as more deployments cross month four. Most of them were sold on the use case and never priced the rest. The companies that survive the year will be the ones that priced it before the deployment.

Where Agor AI Advisory Comes In

Production AI is an architecture problem, not a procurement problem. The shape of the system that survives month four is the part the vendor cannot sell. It has to be designed for the company that runs it, with the policies that company is actually obligated to follow, against the customers that company actually serves, under the regulators that company actually answers to.

This is the work. Most of the AI rollouts that get rolled back skipped it. The buyers thought they were buying a deployed agent. They were buying a pilot with the documentation removed. The Sinch report is the bill, and Klarna is the receipt.

If your team is on the run-up to a production AI deployment, the decision that matters happens before the contract is signed. Not the model choice. Not the vendor. The architecture that surrounds the model and turns it into a system you can defend at week seventeen. Building it after the rollback costs ten times more than building it before deployment. Skipping it costs the brand.

Architect the production layer. Own the eval pipeline. Plan the rollback before you ship. Hire a partner who has read the receipts and will design the system to survive them.

Sources