Will we reverse-engineer a language model into an interpretable (python) program by 2027? | Manifold

Will we reverse-engineer a language model into an interpretable (python) program by 2027?

31

1kṀ2836

2027

8%

chance

1H

6H

1D

1W

1M

ALL

One of the most ambitious goals of mechanistic interpretability would be achieved if we could train a neural network and then distill it into an interpretable algorithm that closely resembles the intermediate computations done by the model.

Some argue this is unlikely to happen (e.g. https://www.lesswrong.com/posts/d52aS7jNcmi6miGbw/take-1-we-re-not-going-to-reverse-engineer-the-ai) , while others are trying to make this happen.

In order for the market to resolve to YES, a model as least as capable as llama2-7B needs to be distilled into python code that can be understood and edited by humans, and this distilled version of the model must perform at least 95% as good as the original model on every benchmark except adversarially constructed ones that specifically highlight the differences between the distilled and the original model.

Mechanistic interpretability

Get

1,000

to start trading!

Sort by:

predictedNO

I think it's plausible~probable we'll reverse-engineer largeish (>=100M parameters) models into understandable/editable python programs but I don't think we'll get near-original model performance with these programs.

@NoaNabeshima I think if we are able to reverse engineer 100M models, this process would become automatable using AI at some point and then we would probably also be able to reverse engineer larger models.

For what performance threshold would give a 50% chance that reverse engineering would succeed?

predictedNO

@NielsW I have no idea, I'll stew on it

People are also trading

Will any 10 trillion+ parameter language model that follows instructions be released to the public before 2026?

Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?

Will Transformer based architectures still be SOTA for language modelling by 2026?

Will all of the publicly accessible parts of heavengames.com/aok.heavengames.com become part of a large language model like Claude or GPT by 2025?

Will a Large Language Model be listed as an author on a peer-reviewed paper by the end of 2025?

Will researchers extract a novel program from the weights of an LLM into a Procedural/OO programming language by 2026?

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

Will a Large Language Model save a human life through medical advice by the end of 2025?

Related questions

Will any 10 trillion+ parameter language model that follows instructions be released to the public before 2026?

Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?

Will Transformer based architectures still be SOTA for language modelling by 2026?

Will all of the publicly accessible parts of heavengames.com/aok.heavengames.com become part of a large language model like Claude or GPT by 2025?

Will a Large Language Model be listed as an author on a peer-reviewed paper by the end of 2025?

Will researchers extract a novel program from the weights of an LLM into a Procedural/OO programming language by 2026?

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

Will a Large Language Model save a human life through medical advice by the end of 2025?

© Manifold Markets, Inc.•Terms•Privacy