He made 10 speculations on Anthropic's paper and what it might contain in advance (paper is here: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html).
The first three clearly happened. The others didn't.
Will we have evidence that at least 8 of these 10 have been publicly accomplished within a year, for at least a model at least roughly as good as Claude Sonnet (e.g. at most 20 points lower than Sonnet on Arena)?
1. 99%: eye-test experiments -- I think the report will include experiments that will involve having humans look at what inputs activate SAE neurons and see if they subjectively seem coherent and interpretable to a human.
2. 95%: streetlight edits -- I think that the report will have some experiments that involve cherry-picking some SAE neurons that seem interpretable and then testing the hypothesis by artificially up/down weighting the neuron during runtime.
3. 80%: some cherry-picked proof of concept for a useful *type* of task -- I think it would be possible to show using current SAE methods that some interesting type of diagnostics/debugging can be done. Recently, Marks et al. did something like this by removing unintended signals from a classifier without disambiguating labels.
[...All of the above are things that have happened in the mechanistic interpretability literature before, so I expect them. However, none of the above would show that SAEs could be useful for practical applications *in a way that is competitive with other techniques*. I think that the report is less likely to demonstrate this kind of thing...]
4. 20%: Doing PEFT by training sparse weights and biases for SAE embeddings in a way that beats baselines like LORA -- I think this makes sense to try, and might be a good practical use of SAEs. But I wouldn't be surprised if this simply doesn't beat other PEFT baselines like LORA. It also wouldn't be interp -- it would just be PEFT.
5. 20%: Passive scoping -- I think that it would potentially be possible and cool to see that models with their SAEs perform poorly on OOD examples. This could be useful. If a model might have unforeseen harmful capabilities (e.g. giving bomb-making instructions when jailbroken) that it did not exhibit during finetuning when the SAE was trained, it would be really cool if that model just simply didn't have those capabilities when the SAE was active. I'd be interested if this could be used to get rid of a sleeper agent. But notably, this type of experiment wouldn't be actual interp. And for this to be useful, an SAE approach would have to be shown to beat a dense autoencoder and model distillation.
6. 25%: Finding and manually fixing a harmful behavior that WAS represented in the SAE training data -- maybe they could finetune SAEs using lots of web data and look for evidence of bad things. Then, they could isolate and ablate the SAE neurons that correspond to them. This seems possible, and it would be a win. But in order to be useful this would need to be shown to be competitive with some type of data-screening method. I don't think it would be.
7. 5%: Finding and manually fixing a novel bug in the model that WASN'T represented in the SAE training data -- I would be really impressed if this happened because I see no reason that it should. This would show that SAEs can allow for a generalizable understanding of the network. For example, if they were somehow able to find/fix a sleeper agent using an SAE that wasn't trained on any examples of defection, I would be impressed.
8. 15%: Using an SAE as a zero-shot anomaly detector: It might be possible to detect anomalies based on whether they have high reconstruction error. Anthropic might try this. It would be cool to show that certain model failures (e.g. jailbreaks) are somewhat anomalous this way. But in this kind of experiment, it would be important for the SAE beat a non-sparse autoencoder.
9. 10%: Latent adversarial training under perturbations to an SAE's embeddings -- I think someone should try this, and I think that Anthropic is interested in it, but I don't think they're working on it currently. (There's a chance I might try this in the future someday.)
10. 5%: experiments to do arbitrary manual model edits -- I don't think the report will have experiments that involve editing arbitrary behaviors in the model that weren't cherry-picked based on analysis of SAE neurons. For example, Anthropic could go to the MEMIT paper and try to replicate a simple random subsample of the edits that the MEMIT paper performed. I don't think they will do this, I don't think it would work well if they tried, and I feel confident that SAE's would be competitive with model editing / PEFT if they did do it.