Do Anthropic's training updates make SAE features as interpretable?
2
15
120
Dec 31
50%
chance

If training SAEs with Anthropic's new training techniques [1] gives equally interpretable features, at the majority of sites in Pythia-2.8B / Gemma-7B / Mistral-7B (whichever actually gets benchmarked), this resolves yes.

Complete methodology for evaluating this question: https://arxiv.org/abs/2404.16014. Resolves yes if anyone ever does this and gets p=0.1 significant results (we found p=0.05 actually quite hard to get without a lot of samples).

I haven't implemented [1] yet so have no insider information, and also I will not trade in this market besides an initial bet.

[1]: https://transformer-circuits.pub/2024/april-update/index.html#training-saes

Get Ṁ600 play money