
Resolution Criteria: This question will be resolved according to my personal judgement. I think about this as follows:
Downstream means the task must be one that I think is interesting/non-spurious (ideally one that non-interpretability people cared about beforehand), and where success is objectively measurable (but can be qualitative rather than quantitative). Being real-world useful is a significant plus, but not necessary. Ultimately this will be my judgement call.
You’re allowed to include rules constraining how the task can be solved if there’s a good argument for how this simulates a realistic situation. Eg with unlearning, you could forbid “just don’t train on the data with the bad concept”, to simulate the real world where we have imperfect data labels (but then, eg, excluding 50% of the bad data should be permissible)
Given a constraint like “within a certain compute budget”, I’ll count 10% of the training compute of the SAE as needing to come from that budget (to simulate it being amortised over many downstream use cases)
A task can be chosen post-hoc by the authors or by me. If I believe a paper provides enough evidence of victory on a post-hoc chosen downstream task, but doesn’t explicitly argue this in the paper, I’ll reach out to the authors and use my best judgement.
SAEs must have been compared to appropriate baselines in a fair fight (probing, steering vectors, prompting, finetuning, adversarial example generation (a la GCG) etc)
SAEs must beat the baselines, not just be competitive. I'll qualitatively judge "do I think the sample size was large enough that the effect was real rather than noise"
I must believe that SAEs were an important part of the solution (i.e. that the same solution wouldn't work without the SAE)
I'll allow other dictionary learning techniques that are not SAEs but try to find a sparse, interpretable decomposition of model activations
The work must be public (e.g. I won't resolve this on private results from my team or gossip from other labs). I'll allow missing details to be clarified in private communication, so long as the key result is public.
I'll qualitatively evaluate how cherry-picked/brittle the results seem. For example, if an SAE is great for steering if the desired concept has a corresponding latent, but only 10% of concepts thought of have corresponding latents, I'm not sure if I'd count that. But if one of those 10% of concepts was a big deal (eg refusal steering) I might count that
This is very subjective, so I will not trade in this market myself.
For context, here are three of the best contenders so far for doing something useful on a downstream task with model internals, and why I don't think they count:
SHIFT in Sam Marks' Sparse Feature Circuits - An SAE circuit was found feeding into a probe that had picked up on a spurious correlation with gender, and ablating key gender tracking features reduced the spurious correlation. IMO this is the closest attempt so far, and they compare reasonably to a range of baselines, but I think falls a bit short, as the task is a bit too contrived/spurious for my tastes. I think a similar thing in a more complex/useful setting could be sufficient though.
Golden Gate Claude - while very fun, they didn't compare to the baselines of a system prompt or steering vector (and I am unconvinced GGC would win). I would have considered a user study on eg "how fun is the system to use" where GGC won over those baselines to be successful resolution. It’s fine if “how fun is the system to use” was chosen after creating GGC, so long as it was chosen before the user study (no p-hacking!) - I want an existence proof of SAEs being useful.
Refusal is mediated by a single direction - a paper where we found a refusal steering vector and ablated it to jailbreak the model. In addition to not being on SAEs (and so instantly disqualified), we only found that it was competitive with fine-tuning, not better. It might have won on jailbreaking at a given compute budget (or at least, be outside the Pareto frontier of jailbreaking successes against compute budgets)