MANIFOLD
Do Anthropic's training updates make SAE features as interpretable?
2
Ṁ120Ṁ20
resolved Jan 6
Resolved as
80%

If training SAEs with Anthropic's new training techniques [1] gives equally interpretable features, at the majority of sites in Pythia-2.8B / Gemma-7B / Mistral-7B (whichever actually gets benchmarked), this resolves yes.

Complete methodology for evaluating this question: https://arxiv.org/abs/2404.16014. Resolves yes if anyone ever does this and gets p=0.1 significant results (we found p=0.05 actually quite hard to get without a lot of samples).

I haven't implemented [1] yet so have no insider information, and also I will not trade in this market besides an initial bet.

[1]: https://transformer-circuits.pub/2024/april-update/index.html#training-saes

RESOLUTION:

No direct calculation but https://transformer-circuits.pub/2024/june-update/index.html suggests yes as the April updates are applied to some SAEs (e.g. Gated SAEs) and they are some of the most interpretable SAEs. https://arxiv.org/abs/2407.14435 finds similar

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ4
© Manifold Markets, Inc.TermsPrivacy