The current state-of-the-art Transformer architecture includes self-attention layers which have quadratic computational complexity in sequence length. This makes inference and training at longer sequence lengths infeasible, and is a limit on the architecture's capabilities.
There has been substantial research into sub-quadratic attention operators ever since the Transformer model was introduced, but so far none have proven to be full replacement for self-attention, usually due to reduced practical performance or even theoretical limits to its capacity.
For this market, my definition of "satisfactory" is a sub-quadratic attention operator that matches full self-attention's performance to the degree that it begins gaining widespread traction and begins being used in research papers not specifically focused on that operator. For example, I would consider RoPE and ALiBi (two positional embedding schemes, not attention operators) to have reached this stage.
Will a satisfactory sub-quadratic attention operator be found before 2026?