Will Sparse Autoencoders be successfully used on a downstream task in the next year and beat baselines?
➕
Plus
76
23k
2025
65%
chance

Resolution Criteria: This question will be resolved according to my personal judgement. I think about this as follows:

  • Downstream means the task must be one that I think is interesting/non-spurious (ideally one that non-interpretability people cared about beforehand), and where success is objectively measurable (but can be qualitative rather than quantitative). Being real-world useful is a significant plus, but not necessary. Ultimately this will be my judgement call.

    • You’re allowed to include rules constraining how the task can be solved if there’s a good argument for how this simulates a realistic situation. Eg with unlearning, you could forbid “just don’t train on the data with the bad concept”, to simulate the real world where we have imperfect data labels (but then, eg, excluding 50% of the bad data should be permissible)

    • Given a constraint like “within a certain compute budget”, I’ll count 10% of the training compute of the SAE as needing to come from that budget (to simulate it being amortised over many downstream use cases)

    • A task can be chosen post-hoc by the authors or by me. If I believe a paper provides enough evidence of victory on a post-hoc chosen downstream task, but doesn’t explicitly argue this in the paper, I’ll reach out to the authors and use my best judgement.

  • SAEs must have been compared to appropriate baselines in a fair fight (probing, steering vectors, prompting, finetuning, adversarial example generation (a la GCG) etc)

  • SAEs must beat the baselines, not just be competitive. I'll qualitatively judge "do I think the sample size was large enough that the effect was real rather than noise"

  • I must believe that SAEs were an important part of the solution (i.e. that the same solution wouldn't work without the SAE)

  • I'll allow other dictionary learning techniques that are not SAEs but try to find a sparse, interpretable decomposition of model activations

  • The work must be public (e.g. I won't resolve this on private results from my team or gossip from other labs). I'll allow missing details to be clarified in private communication, so long as the key result is public.

  • I'll qualitatively evaluate how cherry-picked/brittle the results seem. For example, if an SAE is great for steering if the desired concept has a corresponding latent, but only 10% of concepts thought of have corresponding latents, I'm not sure if I'd count that. But if one of those 10% of concepts was a big deal (eg refusal steering) I might count that

This is very subjective, so I will not trade in this market myself.

For context, here are three of the best contenders so far for doing something useful on a downstream task with model internals, and why I don't think they count:

  • SHIFT in Sam Marks' Sparse Feature Circuits - An SAE circuit was found feeding into a probe that had picked up on a spurious correlation with gender, and ablating key gender tracking features reduced the spurious correlation. IMO this is the closest attempt so far, and they compare reasonably to a range of baselines, but I think falls a bit short, as the task is a bit too contrived/spurious for my tastes. I think a similar thing in a more complex/useful setting could be sufficient though.

  • Golden Gate Claude - while very fun, they didn't compare to the baselines of a system prompt or steering vector (and I am unconvinced GGC would win). I would have considered a user study on eg "how fun is the system to use" where GGC won over those baselines to be successful resolution. It’s fine if “how fun is the system to use” was chosen after creating GGC, so long as it was chosen before the user study (no p-hacking!) - I want an existence proof of SAEs being useful.

  • Refusal is mediated by a single direction - a paper where we found a refusal steering vector and ablated it to jailbreak the model. In addition to not being on SAEs (and so instantly disqualified), we only found that it was competitive with fine-tuning, not better. It might have won on jailbreaking at a given compute budget (or at least, be outside the Pareto frontier of jailbreaking successes against compute budgets)


Get Ṁ1,000 play money
Sort by:

https://openreview.net/forum?id=FVItLat5ii&noteId=sCBMB7O3qw

apparently SAEs aren't necessarily improving upon simple K-means

@Jono3h Wild, I hadn't seen that paper, it's from Sept 2023!

I don't really update from it though - it was done super early, before we knew much about training good SAEs, it's on CNNs, I don't understand or trust their metrics (and neither do the reviewers as far as I can tell), and they don't really discuss how they train the SAEs and it's a bit of a side thing in the paper, so I'm not at all confident they did it competently enough to learn much from the results.

bought Ṁ25 YES at 65%

@NeelNanda Ah I assumed that SAEs on such small models are easy enough to train that that wouldn't be a worry.

I did like the idea of quantifying the quality of the features by applying similarity metrics to the top N maxiximally exciting images (MEIs) for those features.
Intuitively I agree with the authors that you would expect the MEIs for good features to be similar to each other and LPIPS, which the authors use, seems like a well-tested and human-friendly image similarity metric.

I saw that Mekelov, Lange and you have a suggestion for principled evaluations that afaiu involves comparing manually-made dictionaries with SAEs. But I didn't grasp the methodology within 30min.