Will we reverse-engineer a language model into an interpretable (python) program by 2027?
19
135
370
2027
8%
chance

One of the most ambitious goals of mechanistic interpretability would be achieved if we could train a neural network and then distill it into an interpretable algorithm that closely resembles the intermediate computations done by the model.

Some argue this is unlikely to happen (e.g. https://www.lesswrong.com/posts/d52aS7jNcmi6miGbw/take-1-we-re-not-going-to-reverse-engineer-the-ai) , while others are trying to make this happen.

In order for the market to resolve to YES, a model as least as capable as llama2-7B needs to be distilled into python code that can be understood and edited by humans, and this distilled version of the model must perform at least 95% as good as the original model on every benchmark except adversarially constructed ones that specifically highlight the differences between the distilled and the original model.

Get Ṁ200 play money
Sort by:
predicts NO

I think it's plausible~probable we'll reverse-engineer largeish (>=100M parameters) models into understandable/editable python programs but I don't think we'll get near-original model performance with these programs.

@NoaNabeshima I think if we are able to reverse engineer 100M models, this process would become automatable using AI at some point and then we would probably also be able to reverse engineer larger models.

For what performance threshold would give a 50% chance that reverse engineering would succeed?

predicts NO

@NielsW I have no idea, I'll stew on it