
Resolution Criteria: This question will be resolved according to my personal judgement. I think about this as follows:
Downstream means the task must be one that I think is interesting/non-spurious (ideally one that non-interpretability people cared about beforehand), and where success is objectively measurable (but can be qualitative rather than quantitative). Being real-world useful is a significant plus, but not necessary. Ultimately this will be my judgement call.
You’re allowed to include rules constraining how the task can be solved if there’s a good argument for how this simulates a realistic situation. Eg with unlearning, you could forbid “just don’t train on the data with the bad concept”, to simulate the real world where we have imperfect data labels (but then, eg, excluding 50% of the bad data should be permissible)
Given a constraint like “within a certain compute budget”, I’ll count 10% of the training compute of the SAE as needing to come from that budget (to simulate it being amortised over many downstream use cases)
A task can be chosen post-hoc by the authors or by me. If I believe a paper provides enough evidence of victory on a post-hoc chosen downstream task, but doesn’t explicitly argue this in the paper, I’ll reach out to the authors and use my best judgement.
SAEs must have been compared to appropriate baselines in a fair fight (probing, steering vectors, prompting, finetuning, adversarial example generation (a la GCG) etc)
SAEs must beat the baselines, not just be competitive. I'll qualitatively judge "do I think the sample size was large enough that the effect was real rather than noise"
I must believe that SAEs were an important part of the solution (i.e. that the same solution wouldn't work without the SAE)
I'll allow other dictionary learning techniques that are not SAEs but try to find a sparse, interpretable decomposition of model activations
The work must be public (e.g. I won't resolve this on private results from my team or gossip from other labs). I'll allow missing details to be clarified in private communication, so long as the key result is public.
I'll qualitatively evaluate how cherry-picked/brittle the results seem. For example, if an SAE is great for steering if the desired concept has a corresponding latent, but only 10% of concepts thought of have corresponding latents, I'm not sure if I'd count that. But if one of those 10% of concepts was a big deal (eg refusal steering) I might count that
This is very subjective, so I will not trade in this market myself.
For context, here are three of the best contenders so far for doing something useful on a downstream task with model internals, and why I don't think they count:
SHIFT in Sam Marks' Sparse Feature Circuits - An SAE circuit was found feeding into a probe that had picked up on a spurious correlation with gender, and ablating key gender tracking features reduced the spurious correlation. IMO this is the closest attempt so far, and they compare reasonably to a range of baselines, but I think falls a bit short, as the task is a bit too contrived/spurious for my tastes. I think a similar thing in a more complex/useful setting could be sufficient though.
Golden Gate Claude - while very fun, they didn't compare to the baselines of a system prompt or steering vector (and I am unconvinced GGC would win). I would have considered a user study on eg "how fun is the system to use" where GGC won over those baselines to be successful resolution. It’s fine if “how fun is the system to use” was chosen after creating GGC, so long as it was chosen before the user study (no p-hacking!) - I want an existence proof of SAEs being useful.
Refusal is mediated by a single direction - a paper where we found a refusal steering vector and ablated it to jailbreak the model. In addition to not being on SAEs (and so instantly disqualified), we only found that it was competitive with fine-tuning, not better. It might have won on jailbreaking at a given compute budget (or at least, be outside the Pareto frontier of jailbreaking successes against compute budgets)
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ3,740 | |
2 | Ṁ1,145 | |
3 | Ṁ880 | |
4 | Ṁ855 | |
5 | Ṁ778 |
People are also trading
TL;DR: At least two papers meet the bar:
(1) Hypothesis Generation with SAEs (Raj Movva et al) showed SAEs beat baselines at generating natural language hypotheses from text datasets (eg explaining what makes someone click on a newspaper headline), and with better cost/speed.
(2) The Auditing Games paper (Sam Marks et al) had teams of researchers compete to uncover a hidden goal in a language model. And the SAE-equipped team won.
I have quibbles with both papers, but all things considered, both qualify.
My takeaway: SAEs shine when you need understanding without knowing exactly what you're looking for, but are bad if you need precision and have supervised data. They're a useful tool but shouldn't be the central focus of the field - more thoughts here.
---
Hypothesis Generation with SAEs (Raj Movva et al)
They took text datasets—eg newspaper headlines labelled by clickthrough rates—and used SAEs to generate natural language hypotheses that predict the labels. The approach was to train an SAE on embedding vectors, find latents correlated with the label, then auto-interpret those predictive latents to get hypotheses like "political content gets more clicks." They measured performance objectively by getting an LLM to label each data point with the hypotheses and checking if these labels improved regression performance.
The baselines comparison wasn't perfect (they used a fancy setup with a weak LLM, and I'd have loved to see a stronger model with a simpler approach), but given the SAE method had an 11x cost advantage over the fancy scaffold + weak model, I'm still counting it as a win. They also did decent sanity checking that the SAE was finding qualitatively reasonable hypotheses.
Auditing Language Models for Hidden Objectives (Sam Marks et al, Anthropic)
They used a delightfully galaxy brained method (4 stages of finetuning, including teaching fake facts by out of context reasoning) to teach a language model a complex hidden goal, then had teams of researchers compete to uncover it. One team got SAEs, the other didn't. The SAE team won—actually won so hard that a solo member in New York solved it before the West Coast half of the team woke up. So they formed a second SAE team that did worse than the black-box team, which complicates the narrative a bit.
In hindsight you also could've replicated the winning approach without SAEs (it was basically fancy semantic search), and I'm not super happy that the setup included access to the fine-tuning data and many solutions relied on finding the key documents. But the black-box team didn't think of those simple-in-hindsight methods, while SAEs let researchers find them. And ultimately, the specifics don't matter - when you're facing a model that might have hidden goals, what matters is which toolkit gets you to the answer fastest, and this paper provides compelling evidence for SAEs as a useful tool. The sample size is annoyingly small, but given how expensive this experiment was, I'm OK with it.
Runner-up: Sieve (Adam Karvonen et al)
They used conditional SAE steering to stop a model using regular expressions when writing code. Under a bunch of weird and contrived conditions this beat baselines. But a non-interpretability startup specifically requested this, and they beat baselines by the startup's criteria. My hesitation is that regex-avoidance is pretty niche, and I doubt the technique generalises to concepts without extremely clean SAE representations.
Note: I was basically convinced enough to resolve this months ago, but have been busy and failed to make the time to think about this properly. Sorry about that, and any incorrect updates made from the delay.
Overall conclusion:
The pattern I'm seeing is that SAEs consistently fail at tasks where precision is key, especially where we have good supervised data—SAEs are noisy and have a bunch of error, probes etc just work better when you know what you're looking for. But SAEs excel at unsupervised discovery, surfacing unexpected patterns when you don't know what to look for. That's exactly what made both the hypothesis generation and auditing papers work. I'm very excited about the general trend toward operationalising understanding-based tasks with objective measurements (Cywinski et al being another nice example).
I think SAEs are a useful tool but not a game-changer. They shouldn't be the field's major focus going forward, but are definitely worth having in the toolkit.
Would Paint with Ember count? https://x.com/goodfireai/status/1927415017504165978?s=46
Relevant update from the team the market creator works on: https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/
We had that recent paper that argued that sparse auto encoders are not particularly good when compared to steering vectors / representation engineering - still not sure what we are supposed to take from this though - I'm skeptical on the uselessness of sparse auto encoders - but maybe the reconstruction loss is just too high / not high enough fidelity - such that representation engineering is just strictly better - but don't we still want to consider SAE's for quantization?
https://arxiv.org/pdf/2502.16681
My gut is that the death of SAEs is overstated but the market seems not to have updated on the paper - so I've sold.
@CampbellHutcheson Even if SAEs manage to beat RepE (pretty skeptical about this already), they are never beating prompting + fine-tuning. See this:
"SAEs must have been compared to appropriate baselines in a fair fight (probing, steering vectors, prompting, finetuning, adversarial example generation (a la GCG) etc)"
https://openreview.net/forum?id=FVItLat5ii¬eId=sCBMB7O3qw
apparently SAEs aren't necessarily improving upon simple K-means
@Jono3h Wild, I hadn't seen that paper, it's from Sept 2023!
I don't really update from it though - it was done super early, before we knew much about training good SAEs, it's on CNNs, I don't understand or trust their metrics (and neither do the reviewers as far as I can tell), and they don't really discuss how they train the SAEs and it's a bit of a side thing in the paper, so I'm not at all confident they did it competently enough to learn much from the results.
@NeelNanda Ah I assumed that SAEs on such small models are easy enough to train that that wouldn't be a worry.
I did like the idea of quantifying the quality of the features by applying similarity metrics to the top N maxiximally exciting images (MEIs) for those features.
Intuitively I agree with the authors that you would expect the MEIs for good features to be similar to each other and LPIPS, which the authors use, seems like a well-tested and human-friendly image similarity metric.
I saw that Mekelov, Lange and you have a suggestion for principled evaluations that afaiu involves comparing manually-made dictionaries with SAEs. But I didn't grasp the methodology within 30min.