Superposition is a hypothesized mechanism for polysemanticity. It is a major bottleneck for interpretability. There are groups working on reducing it, most notably Chris Olah's group at Anthropic. However, it is possible that reducing superposition is hard, or that superposition is not an accurate model of polysemanticity.
The following would qualify for a YES resolution:
A modified transformer architecture that, when trained, has at most 50% of the superposition than an iso-performance regular transformer
A method for reading out features in superposition from a regular/modified transformer that is able to recover at least 50% of features in superposition
The following would qualify for a (pre-2026) NO resolution:
Only a small fraction of features can be recovered (<50%)
Superposition is shown conclusively to be an invalid model of polysemanticity
In the event that it is unclear how many features are actually in superposition (there could hypothetically be an absurd number of near-orthogonal vectors), only preliminary (and not necessarily conclusive) evidence that the remaining possible directions are not relevant is sufficient to rule them out from consideration.
ok
As a clarification: the method should have to also demonstrably meet the 50% criterion for at least transformers of nontrivial size (GPT-2 as a lower bound), and it should appear plausible that it will scale to frontier transformers (for example, a scaling law demonstrating continued improvement would satisfy this condition). So a one layer transformer will not qualify. I think this is the most natural interpretation of the title--"superposition in transformers" implies transformers in some degree of generality.
@LeoGao Also, additional clarification: >50% variance explained by an autoencoder will not qualify for the >50% of features requirement
Is this for any transformer? How does this resolve if we have an expensive technique that has been validated on small transformers but hasn't been successfully applied to very large transformers?
Cases I'm interested in:
- It satisfies the market criteria for at least one small transformer and it's reasonable to think the best technique in 2026 would work on large transformers if we had really good hardware we currently don't have
- It satisfies the criteria with a small transformer and it's reasonable to think it would work on large transformers but it would be expensive and noone's tried it yet.
- It satisfies the criteria with a small transformer and preliminary results for larger transformers are mixed/don't satisfy criteria of market.
where small is something between ~8M-1B parameters
@BartholomewHughes I didn't think carefully about the actual probability, I think I'm not trying to be a very good predictor on this market fwiw. I've been doing some superposition stuff with some promising early results and attending to public stuff. My main story for this resolving Yes is that Anthropic succeeds. I think trading against me isn't unreasonable. Part what's going on here for me is: just enjoying the feeling of being bullish and (?) incentive to do a good job (seems silly but that's what it's actually like for me)
Interesting question! It honestly wouldn't surprise me if SoLU has at most 50% of the superposition of a normal model, though it's really hard to quantify. My guess is that removing superposition is impossible, but that being able to recover many features is doable-ish, though 50% is a high bar. My best guess for this breaking is just that we never figure out how to quantify the number of features.
@NoaNabeshima If it explains more than half the features or variance or something then I'd resolve yes