Will mechanistic interpretability have more academic impact than representation engineering by the end of 2025?
5
110
110
2026
67%
chance

Measured by number of new citations, will a paper on mechanistic interpretability have more academic impact than a paper on representation engineering by the end of 2025?

I'd expect a mech interp paper to be or have methodology originating from Transformer Circuits and related and/or use relatively low-level units of analysis (e.g., at least as small as small groups of attention heads) to explain a model algorithm. Causal intervention work included here as well.

For representation engineering I'd expect a top-down approach along the lines of Burns et. al. 2022, Turner et. al. 2023, the RepE paper, or https://arxiv.org/abs/2206.10999. Such a paper would probably use similar unsupervised methodology. I would probably include parts of previous NLP work here, e.g. https://arxiv.org/abs/2004.07667 or https://arxiv.org/abs/2309.07311, and parts of the model similarity literature.

Get Ṁ200 play money

More related questions