← Back to Insights

Insight

Quadratic Was a Choice

Ariel Agor
Quadratic Was a Choice

Listen · Read by Leo · click any word to jump

0:00 / · loading…

For three years, every enterprise AI roadmap, every PE-backed deployment company, every model-vendor lock-in strategy has been built on a single quiet assumption: that the cost of paying attention scales with the square of how much you read. Double the document, quadruple the compute. The whole industry has come to treat this as physics.

On May 5, 2026, a Miami startup called Subquadratic launched SubQ 1M-Preview with $29 million in seed funding and a flat claim that the math everyone has been working around is, in fact, optional. Their architecture, which they call Subquadratic Sparse Attention or SSA, ships a model with a 12-million-token context window and reports an attention-compute reduction of close to 1,000x at that scale versus today's frontier transformers. The benchmarks in their technical post are striking: a 7.2x prefill speedup over dense attention at 128,000 tokens, 52.2x at 1 million tokens.

The AI research community split immediately. The technical report is not public. The weights are not open. Some of the benchmarks were single runs, justified by the company on cost grounds. Will Depue, a prominent AI engineer who has worked on long-context systems, said publicly that SubQ is "almost surely a sparse attention finetune of Kimi or DeepSeek," meaning the architectural breakthrough may be a careful engineering wrapper around someone else's open weights rather than a clean-room invention. VentureBeat ran the headline "researchers demand independent proof." A LessWrong post titled "Debunking claims about subquadratic attention" went up within forty-eight hours.

So this might be vapor. It also might be the first viable commercial break from the quadratic regime. Either way, the strategic question the announcement raises does not depend on whether SubQ ships. It depends on whether the buyer has noticed that their entire AI strategy is built on a substrate that was always an engineering choice.

The substrate everyone forgot was a choice

Attention, in the original 2017 transformer paper, was defined as a comparison of every token in the input to every other token in the input. That gives you the n-squared scaling that has defined the economics of large language models ever since. Every doubling of context costs four times the compute. Every doubling of document size costs four times the memory. The 200,000-token context windows that Anthropic and Google now sell are expensive because they are computationally expensive in a literal mathematical sense.

For most of the industry, this stopped looking like a choice years ago. It started looking like the floor. The deployment-company moves last week from OpenAI and Anthropic, which I wrote about yesterday, assume the floor. The 17.5% guaranteed return that TPG, Brookfield, Advent, and Bain Capital negotiated against The Development Company assumes the floor. The five gigawatts of compute Anthropic announced with Google and Broadcom in early May assumes the floor. The reason a forward-deployed engineer matters is that there is enough complexity in stitching a quadratic model into a real workflow to justify a six-figure salary and a stake in the lab.

If the floor moves, the entire stack above it tilts.

A model whose attention cost grows linearly with input length, not quadratically, would change the unit economics of every enterprise AI system on the planet. The 12-million-token context becomes interesting not as a benchmark stunt but as a way of saying that your full ServiceNow log, or your entire SAP instance, or every customer email from the last decade, can be presented to the model as a single in-context document. The vector database tier thins out. The RAG plumbing simplifies. The latency budget shifts. The reason you needed an engineer to design retrieval and chunking and re-ranking starts to fade, because retrieval and chunking and re-ranking were the things you did to avoid paying for quadratic attention.

That entire layer of complexity, and the consultancy revenue attached to it, exists because of n-squared.

What the labs do not want to test

The five-gigawatt buildouts and the four-billion-dollar deployment vehicles have been pitched to investors with one story: AI workloads will grow faster than compute supply, the labs that own the largest compute pool will win, and the customers will be locked in by integration depth. The math holds if attention stays quadratic and context windows are the binding constraint.

The math does not hold if a different attention math wins.

OpenAI and Anthropic are perfectly aware of this. Both labs have published research into sparse attention, mixture-of-experts routing, and linear-attention variants. Both have made calculated bets that hybrid architectures or careful engineering can extend the useful life of the quadratic regime long enough to amortize the capex. The bet says quadratic is good enough until the moats are dug, and once the moats are dug, the substrate underneath does not matter, because the customer is captive.

A real subquadratic competitor breaks that thesis. Not by being a better model. By being a cheaper substrate. The largest customers will run the comparison the moment a credible vendor lets them. If a model with 80% of the capability runs at 5% of the inference cost on long-context tasks, the procurement decision answers itself. The customer that signed a multi-year managed-deployment contract with The Development Company now has a question: when my model is replaced by one that is a thousand times cheaper per attention operation, is the forward-deployed engineer still the right person to redesign my prompts, my agent topology, and my evaluation harness?

The honest answer is no. The honest answer is that the engineer was the right person for the quadratic world, and they have to re-justify their seat in the linear world.

The credibility gap

SubQ has not made it easy to take the claim at face value. The Subquadratic team published a technical post, not a paper. They are running benchmarks on internal infrastructure with limited replication. The model is available as a preview, not as open weights. Felloai, eWeek, DataCamp, and Codiste have all published explainers; none of them have run the model against frontier transformers on standard evals at comparable budgets.

The skeptics have a strong case from history. Mamba, the state-space-model architecture out of Princeton and CMU, was the last big subquadratic story two years ago, and at frontier scale it ended up performing worse than well-tuned transformers on most benchmarks that matter to enterprise buyers. RWKV had a similar arc. DeepSeek Sparse Attention, which actually shipped, was used in hybrid configurations because pure sparse routing degraded quality on tasks that depend on dense cross-token relationships. Kimi Linear from Moonshot followed the same hybrid path. Every prior attempt to step away from n-squared has either underperformed at scale or collapsed back into a partial transformer hidden under a different name.

So Will Depue's skepticism has weight. If SubQ turns out to be a sparse-attention finetune wrapped around Kimi weights or DeepSeek weights, the 1,000x number is real but the architecture is not novel and the comparative advantage is much smaller than the headline suggests. The 12-million-token context is a marketing artifact; the practical performance on long-range reasoning is what matters, and that has not been independently measured.

This is the right way to read the announcement. Not as a settled architectural victory, and not as a press cycle to ignore. As the first credible signal that the quadratic regime is contestable on commercial terms.

Why this is the buyer's problem, not the vendor's

Most enterprise readers will look at SubQ and decide it is too early. The model is a preview. The team is small. The benchmarks are disputed. The right move, they will conclude, is to wait for independent replication, watch which lab announces a credible response, and let the dust settle.

That is exactly the wrong instinct.

The reason it is wrong is that the deployment contracts being signed right now run for three to five years. The PE-backed managed-deployment vehicles that announced themselves between May 4 and May 13 are designed to lock in customers across that window. The architectural decisions a forward-deployed engineer makes in the first 90 days of an engagement become the abstraction layer your company depends on for the rest of the contract. If those decisions assume quadratic attention, vector databases, aggressive chunking, careful retrieval design, and 200,000-token effective context, your operations get bent around those assumptions.

When the substrate shifts, you have a choice. You can rebuild the production system to take advantage of cheaper long-context, which means unwinding the work the lab's engineer just baked in. Or you can keep paying for the old architecture because the cost of migration is now larger than the savings on inference. The lab knows this. The lab is selling you the second option.

The buyer who refuses to think about substrate change is the buyer who pays for quadratic attention through 2030, including the years when the rest of the market has moved on.

The independence test

There is a clean test for whether your AI architecture is independent of substrate. Ask whoever designed it the following question: if a model with linear attention and 10x lower inference cost on long documents launched next quarter, how many of the abstractions in our current production system would have to change?

If the answer is "very few, because we routed everything through a model-agnostic API layer that lets us switch the underlying model with a config change," you are protected. If the answer involves a list of vector store migrations, chunking strategy rewrites, retrieval pipeline rebuilds, and prompt-template adaptations, you are exposed. You have an architecture that assumed the substrate.

Most enterprise AI deployments today fall in the second category. Not because the engineers building them are careless, but because the cheapest path to a working system in 2024 and 2025 was to build directly on top of OpenAI's or Anthropic's quadratic attention, accept the context limits, and design retrieval to compensate. That was the rational decision at the time. It is no longer the rational decision, because the cost of optionality has fallen and the probability that the substrate moves has risen.

The lab's deployment engineer will not redesign your system to be substrate-agnostic. Their compensation, their stock, and their roadmap all run through one specific substrate. They are not paid to give you optionality across substrates. They are paid to deepen your dependence on the substrate their company sells.

An independent architecture advisor will redesign your system to be substrate-agnostic, because that is the only design that survives the next decade of model churn. SubQ may or may not be the model that shifts the regime. Something will. Linear-attention architectures have been a publishable research result for five years; the gap between publishable and shippable is closing every quarter, and the moment one credible commercial subquadratic model proves its quality at frontier capability, the procurement teams at every Fortune 500 will run the comparison.

What credible buyers are already doing

The buyers I work with who have figured this out are doing four specific things this quarter.

First, they are auditing every place in their production AI stack where the architecture assumes a specific attention cost curve. Anywhere they have decided "we can only fit X documents into the prompt" or "we need to chunk to Y tokens" or "we need to summarize before passing to the model" deserves a note. Those are substrate-dependent decisions. They become technical debt the day a different substrate ships.

Second, they are demanding that every production AI call go through an internal routing layer that owns the model selection, the prompt templating, the tool registration, and the retry logic. The model vendor sees a request shaped by the customer's own abstraction, not the customer's data structured for the vendor's API. This sounds like a small thing. It is the difference between paying the switching cost once now and paying it many times over the next five years.

Third, they are negotiating IP ownership of prompts, evaluation sets, agent definitions, and tool schemas before any forward-deployed engineer crosses their threshold, not after. The default contract language, written by lawyers paid by the vendor, will hand the artifacts to the vendor by default. The negotiated language can leave the artifacts with the buyer if the buyer asks early enough.

Fourth, they are commissioning shadow architecture reviews from advisors who do not take money from the labs. Someone has to be in the room whose job is to ask whether the system the vendor's engineer just designed survives a substrate change. That question cannot be asked by the engineer who designed it. The buyer needs an independent voice in the design loop from week one.

The window

Subquadratic is a Miami startup with a small team and a thirty-million-dollar seed. They are not going to displace OpenAI or Anthropic this year, and probably not next year. Even if SubQ 1M-Preview turns out to be exactly what they claim, the path from a preview model with disputed benchmarks to a production-grade vendor with reliability commitments and enterprise contracts takes time.

That is not the point.

The point is that the announcement, the funding, the press cycle, and the technical debate are the first visible crack in a substrate that has been treated as fixed since 2017. Even if Subquadratic fails, the next subquadratic vendor is six months behind them, and that vendor learned from the credibility mistakes Subquadratic made. The frontier labs know this. Look at how quickly Anthropic and OpenAI have published their own sparse-attention work. They are signaling to investors that the moat is not just the model; it is the integration. They are right that the moat is not the model. They are wrong that the integration survives a substrate change, unless the integration is designed to.

That design is what an architecture practice does. The lab's forward-deployed engineer cannot do it, because the lab's interest is to make the integration depend on the lab's specific substrate. The customer's in-house team can do it, but only if the team is staffed for architectural work rather than feature delivery, which most teams are not. The right shape is an independent partner who specifies the architecture, picks the abstractions, owns the substrate-agnostic design, and supervises the deployment work whoever performs it.

Agor AI Advisory exists to do that work. We do not take fees from the labs. We do not co-invest with the deployment companies. We do not retain economic interest in your substrate choice. Our job is to make sure the system you build in 2026 still serves you in 2030, regardless of whether Subquadratic ships, regardless of whether the next non-transformer architecture comes from Princeton or Hangzhou or somewhere nobody has heard of yet, regardless of which lab announces what next month.

The buyer who treats SubQ as a curiosity and goes back to negotiating with OpenAI's deployment engineer is making a bet on quadratic attention. The buyer who treats SubQ as a signal and adjusts the architecture to be substrate-agnostic is buying optionality on every possible future. The first bet has a known floor. The second bet has a known ceiling.

The substrate was always a choice. It is becoming visible as a choice. The companies that notice that fact this quarter will spend the next decade on the right side of the curve.

Sources

  • [Subquadratic, Introducing SubQ, May 5, 2026](https://subq.ai/introducing-subq)
  • [VentureBeat, May 6, 2026](https://venturebeat.com/technology/miami-startup-subquadratic-claims-1-000x-ai-efficiency-gain-with-subq-model-researchers-demand-independent-proof)
  • [LessWrong, Debunking claims about subquadratic attention](https://www.lesswrong.com/posts/kpSXeMcthtHgnwMx3/debunking-claims-about-subquadratic-attention)
  • [Refresh Miami, Subquadratic $29M seed](https://refreshmiami.com/news/subquadratic-raised-29m-on-the-idea-that-it-has-cracked-ais-biggest-math-problem-now-comes-the-hard-part/)
  • [eWeek, Subquadratic Launches SubQ](https://www.eweek.com/news/subquadratic-subq-12m-token-llm-neuron/)
  • [DataCamp, SubQ AI Explained](https://www.datacamp.com/blog/subq-ai-explained)
  • [WhatLLM, New AI Models May 2026](https://whatllm.org/blog/new-ai-models-may-2026)