Do Anthropic's training updates make SAE features as interpretable? | Manifold

Do Anthropic's training updates make SAE features as interpretable?

Basic

2

Ṁ20

Dec 31

50%

chance

1D

1W

1M

ALL

If training SAEs with Anthropic's new training techniques [1] gives equally interpretable features, at the majority of sites in Pythia-2.8B / Gemma-7B / Mistral-7B (whichever actually gets benchmarked), this resolves yes.

Complete methodology for evaluating this question: https://arxiv.org/abs/2404.16014. Resolves yes if anyone ever does this and gets p=0.1 significant results (we found p=0.05 actually quite hard to get without a lot of samples).

I haven't implemented [1] yet so have no insider information, and also I will not trade in this market besides an initial bet.

[1]: https://transformer-circuits.pub/2024/april-update/index.html#training-saes

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

Related questions

Will Anthropic's April SAE Training Updates stack with Gated SAEs?

Will Anthropic open-source the training code of their SAE interpretability effort?

Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)

When will Anthropic first train an AI system that they claim qualifies as ASL-3?

Will Anthropic automate AI research in 2024?

Are Gated SAEs better than Anthropic's training updates?

Will Anthropic release a model that thinks before it responds like o1 from OpenAI by EOY 2024?

Will Anthropic announce one of their AI systems is ASL-3 before the end of 2025?

Is the level of autism/asperger's higher in Anthropic than OpenAI?

How difficult will Anthropic say the AI alignment problem is?

Related questions

Will Anthropic's April SAE Training Updates stack with Gated SAEs?

Are Gated SAEs better than Anthropic's training updates?

Will Anthropic open-source the training code of their SAE interpretability effort?

Will Anthropic release a model that thinks before it responds like o1 from OpenAI by EOY 2024?

Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)

Will Anthropic announce one of their AI systems is ASL-3 before the end of 2025?

When will Anthropic first train an AI system that they claim qualifies as ASL-3?

Is the level of autism/asperger's higher in Anthropic than OpenAI?

Will Anthropic automate AI research in 2024?

How difficult will Anthropic say the AI alignment problem is?

© Manifold Markets, Inc.•Terms + Mana-only Terms•Privacy•Rules