If LMs store info as features in superposition, are there >300K features in GPT-2 small L7? (see desc)

1kṀ286

3000

59%

chance

ALL

If in 2040 I am convinced at >80% confidence that LMs mainly store info in their residual stream as something like sparse linear features, and I am >80% confident in a particular approximate number of features in the residual stream before layer 7, market resolves Yes if that number is greater than 300K. If that number is less than 300K resolves No. Otherwise resolves N/A.

I won't bet in this market.

Mechanistic interpretability

Get

1,000

to start trading!

People are also trading

If LMs store info as features in superposition, does # features scale superlinearly with number of model parameters?

41% chance

Will GPT-5 have fewer parameters than GPT-4? (1500M subsidy)

23% chance

Size of smallest open-source LLM marching GPT 3.5's performance in 2025? (GB)

1.83

Will GPT-5 have over 100 trillion parameters?

4% chance

Will GPT-5 have over 10 trillion parameters?

25% chance

Will GPT-5 have over 1 trillion parameters?

91% chance

Will the best LLM in 2027 have <1 trillion parameters?

26% chance

Will it be possible to disentangle most of the features learned by a model comparable to GPT-2 this decade?

84% chance

Will it be possible to disentangle most of the features learned by a model comparable to GPT-4 this decade?

37% chance

Will it be possible to disentangle most of the features learned by a model comparable to GPT-3 this decade? (1k subsidy)

Sort by:

https://transformer-circuits.pub/2022/toy_model/index.html

https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

(will add more urls later)

@NoaNabeshima Interesting market, but for those of us who are uninitiated in the minutiae of LLMs, can you please give a quick sketch of what kind of system this "GPT-2 small L7" is? is it a small model explicitly for research purposes? How many parameters? Anything noteworthy about its architecture? Relatedly, on what are you anchoring the 300K boundary of this question?

(I know the info is probably scattered somewhere around the links you posted, but hopefully you know this off the top of your head)

@VitorBosshard GPT-2 small is the smallest model in the GPT-2 family of models
https://github.com/openai/gpt-2

L7 is short for Layer 7. There are.. 12? layers in GPT-2 small. It has ~117M parameters.

Where does 300K come from:
So there are ~50K distinct tokens. There could be lots of features in there, but probably not too much more than ~130K and probably not too much less than 10K. Let's say 30K.

Each per-token MLP goes from 768 to 3072=768*4 back to 768. Imagine that there's some linear "amt of feature writeouts" the MLPs can do that's proportional to 3072, say a factor of 5. Then each MLP has the capacity to write out 15K features.

Then there would be (120K features before 7th layer) = 6 layers*15K features written out / layer + 30K embedding features

If the structure of this fermi estimate is right then superposition feels more tractable, especially that the amount of features written out by the MLP is linear in the number of neurons of the MLP with a small-ish factor.

It's possible that MLPs are doing crazy amounts of superposition. If they are, then the number of features in layer 7 could be way more than 300K and solving superposition for larger models feels less tractable.

A hypothesis I'm holding is that the fermi estimate is at least a bit right, there's some natural upper bound on the amount of features a layer can meaningfully writeout that's <15x the number of neurons, which makes superposition more tractable to solve.

I think this idea originally came from @LeoGao but all mistakes are mine not his and I'm not sure he would endorse this comment.

@NoaNabeshima Please convince me that this hypothesis is wrong/right, Manifold! @Feanor

@NoaNabeshima Oh and also attention heads can significantly increase the number of features

For example copying part of induction heads might double the number of features stored in the embedding because they copy identity information from each token to the next token.

predictedYES

@NoaNabeshima Yep, 12 layers, 85 million parameters (at least the transformer_lens one) : Model properties table

So,, we project from a vocab of 50257 to a residual stream dimension of 768; and from there, we project to 3072 dimensions, and back and forth bw residual and neuron dimensions each layer.

If there were 5 features and 1 neuron, and the neuron had to somehow represent them, it'd have a lot of interference and there'd be superposition. Here we have say, 30k features (not sure about the number) and have to project down to a very small dimensional space compared to that (768 first, and 3k next). It is also possible to organize features in some patterns/shapes so that with just 1 shape, you can represent multiple features at the same time; these unique shapes are what are feature geometry.

In addition to this, we also have attention heads of dimension... just 64 (!) and they're moving the information across features and cleverly finding highly efficient ways to learn those features in clever ways and shapes and forms and then project those learnings back down to the residual stream.

All of these add to the possible numbers of features that could be stored

edit: hows the 300k estimate done tho

@NoaNabeshima Ok, I think the essence of your claim is that features are roughly linear in model size (not necessarily raw parameter count but perhaps some more other relevant metric), and that 300K is a reasonable number to operationalize this intuition.

In terms of what "superposition" means, bloom filters come to mind. They store information in overlapping ways, at the cost of a certain error rate. The reason why this is useful is because the error rate scales better than 1:1. So the question is if LLMs are doing something along these lines implicitly, and if the efficiency at which they do it is more than linear.