Will a satisfactory sub-quadratic attention operator be found to replace quadratic self-attention layers by 2026?
11
101
250
2026
75%
chance

The current state-of-the-art Transformer architecture includes self-attention layers which have quadratic computational complexity in sequence length. This makes inference and training at longer sequence lengths infeasible, and is a limit on the architecture's capabilities.

There has been substantial research into sub-quadratic attention operators ever since the Transformer model was introduced, but so far none have proven to be full replacement for self-attention, usually due to reduced practical performance or even theoretical limits to its capacity.

For this market, my definition of "satisfactory" is a sub-quadratic attention operator that matches full self-attention's performance to the degree that it begins gaining widespread traction and begins being used in research papers not specifically focused on that operator. For example, I would consider RoPE and ALiBi (two positional embedding schemes, not attention operators) to have reached this stage.

Will a satisfactory sub-quadratic attention operator be found before 2026?

Get Ṁ200 play money
Sort by:

I think I'd count Mamba's S6 layers for this, if they catch on and start being used in papers where Mamba's not the main novelty.

This might be controversial, because they're not really an attention operator like Linear Attention or others, since they require keeping track of a hidden state. On the other hand, S6 or even a whole Mamba block can be used as a layer in an otherwise normal Transformer, which is why I'm leaning towards it counting. Any thoughts?

More related questions