Do Anthropic's training updates make SAE features as interpretable?
Basic
2
Ṁ20Dec 31
50%
chance
1D
1W
1M
ALL
If training SAEs with Anthropic's new training techniques [1] gives equally interpretable features, at the majority of sites in Pythia-2.8B / Gemma-7B / Mistral-7B (whichever actually gets benchmarked), this resolves yes.
Complete methodology for evaluating this question: https://arxiv.org/abs/2404.16014. Resolves yes if anyone ever does this and gets p=0.1 significant results (we found p=0.05 actually quite hard to get without a lot of samples).
I haven't implemented [1] yet so have no insider information, and also I will not trade in this market besides an initial bet.
[1]: https://transformer-circuits.pub/2024/april-update/index.html#training-saes
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Related questions
Related questions
Will Anthropic's April SAE Training Updates stack with Gated SAEs?
64% chance
Are Gated SAEs better than Anthropic's training updates?
63% chance
Will Anthropic open-source the training code of their SAE interpretability effort?
Will Anthropic release a model that thinks before it responds like o1 from OpenAI by EOY 2024?
18% chance
Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)
57% chance
Will Anthropic announce one of their AI systems is ASL-3 before the end of 2025?
59% chance
When will Anthropic first train an AI system that they claim qualifies as ASL-3?
Is the level of autism/asperger's higher in Anthropic than OpenAI?
52% chance
Will Anthropic automate AI research in 2024?
5% chance
How difficult will Anthropic say the AI alignment problem is?