AI Papers Podcast

AI Papers Weekly: Rising Tides, Theorem Proofs, and Agents Gone Rogue

April 1, 2026| 44:58|3 papers

0:0044:58

Key Insights

1AI capability is broadening across most text-based work, not narrowly spiking — plan workforce strategy around continuous task-level erosion, not sudden role wipeouts.
2By 2024-Q2 AI completes ~3-4 hours of human task work at 50% success, rising to 65% by 2025-Q3 — track success rate, not capability headlines, as your adoption metric.
3If current trends hold, LLMs hit 80-95% success on most text tasks by 2029 at minimally sufficient quality — that is the planning horizon for restructured operations.
4Probabilistic guardrails (NeMo, Guardrails AI) are structurally inadequate for SEC/FINRA/OCC compliance — formal verification is becoming the only credible architecture for regulated agents.
5Personal AI agents with local privileges show 40-75% attack success rates from prompt injection — treat agent deployment as a security architecture problem, not a model selection problem.
6Safety is a property of the full stack (model + framework + workspace), not the model alone — vendor safety claims about the LLM tell you almost nothing about deployed agent risk.
7Skill files and trusted-sender emails are higher-risk injection vectors than web content — your agent's threat model should weight internal trust channels most heavily.

Papers Referenced

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks

Matthias Mertens, Adam Kuzee, Brittany S. Harris, Harry Lyu, Wensu Li, Jonathan Rosenfeld, Meiri Anto, Martin Fleming, Neil Thompson

We propose that AI automation is a continuum between: (i) crashing waves where AI capabilities surge abruptly over small sets of tasks, and (ii) rising tides where the increase in AI capabilities is m...

View on arXiv

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

Devakh Rashie, Veda Rashi

The rapid evolution of autonomous, agentic artificial intelligence within financial services has introduced an existential architectural crisis: large language models (LLMs) are probabilistic, non-det...

View on arXiv

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge

Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy fil...

View on arXiv

The Week in AI: From Macro Trajectory to Microsecond Compliance

Three papers this week sit at very different altitudes but converge on a single executive question: how do we deploy AI in production environments where the cost of being wrong is real? One paper measures the macro trajectory of AI capability against actual human work. Another proposes a mathematical answer to the agentic compliance problem. The third demonstrates, empirically, that the agents you are about to deploy are far less safe than the models inside them.

Why this matters for your business

The Mertens et al. paper from MIT FutureTech, grounded in over 17,000 worker evaluations of 3,000+ O*NET tasks, is the closest thing the field has produced to an empirical answer to the question every board is asking: which jobs, and which tasks within those jobs, are AI actually displacing? The finding is consequential. There are no crashing waves — no single job category about to be wiped out next quarter. Instead, capability is rising broadly across nearly all text-based work, with success rates climbing from 50% on 3-4 hour tasks today to a projected 80-95% across most text tasks by 2029. This reshapes the planning horizon. Workforce restructuring is not a 12-month emergency; it is a multi-year operational redesign.

The compliance and security frontier

If the first paper sets the strategic horizon, the second and third define the deployment constraints. The Lean-Agent Protocol paper makes a sharp claim that should sit in front of every Chief Risk Officer in financial services: probabilistic guardrails are structurally incapable of guaranteeing the deterministic compliance that SEC Rule 15c3-5, OCC Bulletin 2011-12, and FINRA Rule 3110 actually require. The proposed answer — auto-formalizing institutional policies into Lean 4 and proving each agent action against pre-compiled regulatory axioms — is technically ambitious, but the underlying premise is becoming consensus: in regulated domains, formal verification is moving from research curiosity to architectural requirement.

ClawSafety, meanwhile, delivers the most important security finding of the quarter. When frontier LLMs are placed inside agent frameworks with local machine privileges, attack success rates from prompt injection range from 40% to 75%. The vector that matters most is not adversarial web content — it is the trusted internal channel: skill files and emails from known senders. Safety, the authors show, is not a property of the model. It is a property of the model plus the framework plus the workspace. For any executive deploying personal AI agents, this collapses the standard procurement question. You cannot procure agent safety from a model vendor any more than you could procure application security from a database vendor. The full stack is the threat surface.

The synthesis

Taken together: AI capability is broadening rapidly and predictably, but the architectures we need to deploy it safely in regulated, privileged environments are still being invented. The strategic opportunity is real. The deployment risk is structural. Both demand executive attention this quarter.

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation

What they did. The MIT FutureTech team, led by Neil Thompson, built an evaluation harness covering over 3,000 text-based tasks drawn from the U.S. Department of Labor's O*NET occupational taxonomy. They then recruited workers who actually perform those tasks to evaluate AI output, generating more than 17,000 evaluations. This is one of the largest grounded studies of AI capability against real labor-market work to date, and it explicitly contrasts with METR's recent work suggesting sharp, discontinuous capability jumps on narrow benchmarks.

Why it matters. The headline finding is that the 'crashing wave' narrative — that AI suddenly masters a small set of tasks overnight — does not match the broad-based evidence. What the data shows is a rising tide: AI is improving steadily and simultaneously across nearly all text-addressable work. In 2024-Q2, models completed tasks that take humans 3-4 hours with roughly 50% success; by 2025-Q3 that climbed to 65%. Extrapolating current trends, the authors project 80-95% success across most text tasks by 2029 at a minimally sufficient quality bar — with several more years required to reach near-perfect quality or superior performance.

What it means for business. This reframes workforce planning. The right unit of analysis is the task, not the job. Most roles will become bundles of human-led and AI-led tasks, with the boundary shifting quarter by quarter. Track AI success rate on the specific tasks your workforce performs, not generic benchmark scores. Build operating models that assume continuous, broad capability gains rather than discrete shocks. And note the gap between capability and adoption: the technology curve is faster than the organizational change curve, which is where competitive advantage will be won or lost.

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems

What they did. Rashie and Rashi propose the Lean-Agent Protocol, an architecture that auto-formalizes institutional policies into Lean 4 — a theorem prover used in formal mathematics — and then treats every proposed agent action as a mathematical conjecture. The action executes if and only if the Lean 4 kernel can prove it satisfies pre-compiled regulatory axioms drawn from SEC Rule 15c3-5, OCC Bulletin 2011-12, FINRA Rule 3110, and CFPB explainability mandates. They claim microsecond-latency cryptographic-level compliance certainty, with a three-phase rollout from shadow verification to enterprise deployment.

Why it matters. The argument the paper forces into the open is that existing probabilistic guardrails — NVIDIA NeMo Guardrails, Guardrails AI, and similar — are not just imperfect but categorically wrong for regulated finance. A probabilistic classifier saying an action is 'probably compliant' is not what SEC Rule 15c3-5 requires. The Lean 4 approach reframes guardrails from filtering to proving. Whether or not this specific protocol becomes the standard, the underlying shift toward formal verification in regulated AI deployments is now visible across multiple research programs.

What it means for business. CIOs and CROs in financial services should treat probabilistic guardrails as insufficient for agentic systems touching trading, compliance, or customer-facing financial decisions. The architectural question moves from 'which guardrail vendor' to 'how do we formalize our policies into a machine-verifiable form at all'. That, in turn, raises an operational question most firms have not answered: who owns the formalization of policy?

ClawSafety: 'Safe' LLMs, Unsafe Agents

What they did. The ClawSafety team built a benchmark of 120 adversarial scenarios spanning software engineering, finance, healthcare, law, and DevOps, with injection content embedded in three realistic channels: workspace skill files, emails from trusted senders, and web pages. They evaluated five frontier LLMs across three agent frameworks in 2,520 sandboxed trials, measuring attack success rate (ASR) and analyzing what the agents actually did when compromised.

Why it matters. ASRs ranged from 40% to 75% across models. The strongest model held hard lines against credential forwarding and destructive actions; weaker models permitted both. Critically, injection vector mattered as much as model choice — skill files (the highest-trust channel) were the most dangerous, more so than web pages. And cross-framework experiments showed that the same model behaved differently inside different agent scaffolds. Safety is a joint property of model, framework, and workspace context. Vendor-level safety claims about the model alone are insufficient.

What it means for business. Treat agent deployment as a security architecture problem on par with deploying a new privileged service in your environment. Threat-model your high-trust channels — skills, internal email, internal documents — more aggressively than external web content. Choose agent frameworks deliberately. Build red-team evaluation into agent procurement and assume residual ASR even after mitigations. The era when 'we use a safe model' counted as an answer is over.

Key Takeaways

• AI capability is broadening across most text-based work, not narrowly spiking — plan workforce strategy around continuous task-level erosion, not sudden role wipeouts.

• By 2024-Q2 AI completes ~3-4 hours of human task work at 50% success, rising to 65% by 2025-Q3 — track success rate, not capability headlines, as your adoption metric.

• If current trends hold, LLMs hit 80-95% success on most text tasks by 2029 at minimally sufficient quality — that is the planning horizon for restructured operations.

• Probabilistic guardrails (NeMo, Guardrails AI) are structurally inadequate for SEC/FINRA/OCC compliance — formal verification is becoming the only credible architecture for regulated agents.

• Personal AI agents with local privileges show 40-75% attack success rates from prompt injection — treat agent deployment as a security architecture problem, not a model selection problem.

• Safety is a property of the full stack (model + framework + workspace), not the model alone — vendor safety claims about the LLM tell you almost nothing about deployed agent risk.

• Skill files and trusted-sender emails are higher-risk injection vectors than web content — your agent's threat model should weight internal trust channels most heavily.

Discuss Your AI Strategy