Do Anthropic's training updates make SAE features as interpretable?
2
15
Ṁ20Ṁ120
Dec 31
50%
chance
1D
1W
1M
ALL
If training SAEs with Anthropic's new training techniques [1] gives equally interpretable features, at the majority of sites in Pythia-2.8B / Gemma-7B / Mistral-7B (whichever actually gets benchmarked), this resolves yes.
Complete methodology for evaluating this question: https://arxiv.org/abs/2404.16014. Resolves yes if anyone ever does this and gets p=0.1 significant results (we found p=0.05 actually quite hard to get without a lot of samples).
I haven't implemented [1] yet so have no insider information, and also I will not trade in this market besides an initial bet.
[1]: https://transformer-circuits.pub/2024/april-update/index.html#training-saes
Get Ṁ600 play money
More related questions
Related questions
By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?
48% chance
Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)
57% chance
Will Anthropic's April SAE Training Updates stack with Gated SAEs?
64% chance
Are Gated SAEs better than Anthropic's training updates?
63% chance
Will it be possible to disentangle most of the features learned by a model comparable to GPT-3 this decade? (1k subsidy)
55% chance
What will the Anthropic SAE paper contain?
SoAI 23 3/10: Will Self-improving Al agents crush SOTA in a complex environment (e.g. AAA game, tool use, science)?
29% chance
Will it be possible to disentangle most of the features learned by a model comparable to GPT-4 this decade?
39% chance
Are Mixture of Expert (MoE) transformer models generally more human interpretable than dense transformers?
50% chance
In a year from today, will I have a satisfactory framework for describing the epistemology of AI alignment?
38% chance