4
Will mechanistic interpretability be essentially solved for GPT-2 before 2030?
49
closes 2030
25%
chance

Mechanistic interpretability aims to reverse engineer neural networks in a way that is analogous to reverse engineering a compiled binary computer program. Achieving this level of interpretability for a neural network like GPT-2 would involve creating a binary computer program that is interpretable by expert human programmers and can emulate the input-output behavior of GPT-2 with high accuracy.

Before January 1st, 2030, will mechanistic interpretability be essentially solved for GPT-2, resulting in a binary computer program that is interpretable by ordinary expert human programmers and emulates GPT-2's input-output behavior up to a high level of accuracy?

Resolution Criteria:

This question will resolve positively if, before January 1st, 2030, a binary computer program is developed that meets the following criteria:

  1. Interpretability: The binary computer program must be interpretable by ordinary expert human programmers, which means:
    a. The program can be read, understood, and modified by programmers who are proficient in the programming language it is written in, and have expertise in the fields of computer science and machine learning.
    b. The program is well-documented, with clear explanations of its components, algorithms, and functions.
    c. The program's structure and organization adhere to established software engineering principles, enabling efficient navigation and comprehension by expert programmers.

  2. Accuracy: The binary computer program must emulate GPT-2's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-2 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.

  3. Not fake: I will use my personal judgement to determine whether a candidate solution seems fake or not. A fake solution is anything that satisfies these criteria without getting at the spirit of the question. I'm trying to understand whether we will reverse engineer GPT-2 in the complete sense, not just whether someone will create a program that technically passes these criteria.

This question will resolve negatively if, before January 1st, 2030, no binary computer program meeting the interpretability and accuracy criteria is developed and verified according to the above requirements. If there is ambiguity or debate about whether a particular program meets the resolution criteria, I will use my discretion to determine the appropriate resolution.

Sort by:
NoaNabeshima avatar
Noa Nabeshimabought Ṁ100 of NO

```
Accuracy: The binary computer program must emulate GPT-2's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-2 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.
```

I wish this requirement was more like "attains CE loss similar to GPT-2" or "has low average KL divergence from the GPT-2 next-token distribution at temperature 1.0", which I think would be more convincing to me.

NeelNanda avatar
Neel Nandabought Ṁ100 of NO

Your criteria seem too harsh (unless you allow an incredibly long program?). Also, I like hedging!

stuhlmueller avatar
Andreas Stuhlmülleris predicting YES at 23%
AlexbGoode avatar
Alex B. Goode

Can you elaborate on what this market is about? Because what you are describing in your criteria fits gpt-2 as is. It is a computer program. It's open source. You can go have a look right now. I assume I am missing what you are asking.

RobinGreen avatar
Robin Greenis predicting NO at 23%

@AlexbGoode The model weights of GPT-2, which are the actual mechanism, the "goose that lays the golden egg" as it were, are not interpretable by expert computer programmers. That's the point.

AlexbGoode avatar
Alex B. Goode

@RobinGreen I disagree with your statements, but I am not sure where our disagreement lies. Maybe it is your use of the words mechanistic and interpretable? The ANN in GPT-2 are perfectly interpretable even by a novice programmer, in a mechanistic way. The formulas are very simple and for a fixed input also very easy to compute. I can easily change all parts of the program. The thing that is hard is predicting what will happen, if we do that. A change in inputs or weights or even architecture is hard to predict for these systems. But this "non-interpretablity" is not a property of GPT-2, but of non-linear maps in general. This is, in a sense, what it means to be non-linear. You can easily change the word "GPT-2" by any sufficiently non-linear map in your question and it will always resolve to NO or YES depending on semantics of the question that have nothing to do with scientific progress or GPT-2.

Maybe an imperfect analog, so we are not bogged down by the "AI" hype so much.
Let's say I ask: Is mechanistic interpretability essentially solved for the Lorenz system?

How would you resolve that question? (non-rhetoric question, I would appreciate if you could give an answer)
Can you write a program that predicts the state of a Lorenz system after some time? Yes, of course. Can an expert programmer predict what will happen to that output if you change the initial state or parameters of the system, without solving the differential equations? No, almost by definition that is not possible.

firstuserhere avatar
firstuserhereis predicting YES at 23%

@AlexbGoode Mech interp is about reverse engineering the weights etc to figure out the algorithms or processes implemented by them. So, i agree that the descriptions for these questions are way too restricting, and propose to change them.

(We shouldn't necessarily assume that everything that gpt-2/3/4 etc are doing in their internal representations - the features they consider important - are important in a human centric way, and need to be understood so that there's a table of feature and algorithm as well. see modular addition work by Neel, for example. some features will just not be relevant to us but are an important part of the data)

NLeseul avatar
NLeseul

I really like this market, but I'm concerned that the criteria for "interpretable" are too strict. There's an awful lot of human-authored programs that aren't interpretable by that standard (opaque, uncommented, or poorly-architected).

A better way of putting it might just be that a skilled programmer should be capable of making precise changes to the model, such as adding/removing facts or modifying behaviors, without having to do it by shoveling more training data through it.

firstuserhere avatar
firstuserherebought Ṁ0 of YES

@NLeseul Yes, i agree. Something like predictable behaviour, steerability, changes to the model leading to predictable/estimated changes in behavior, ability to know what changes to do to the model change some specific ways it behaves consistently. Conversion to human interpretable features although sounds nice doesn't necessarily fit.

FranekZak avatar
Franek Żakbought Ṁ100 of NO

Does the phrase „binary program” have any specific meaning here?

RyanGreenblatt avatar
Ryan Greenblattbought Ṁ500 of NO

I'm at >95% that this is literally impossible for human programmers.
It seems like this would be totally crazy for this to be possible.

firstuserhere avatar
firstuserhereis predicting YES at 30%

@RyanGreenblatt cheers. Let's try to make the totally crazy possible

Lovre avatar
Lovreis predicting YES at 25%

Seems to me that in the mainline we don't even get a CNN interpreter which meets these resolution criteria, though probably in part due to lack of trying – if it were a major civilization goal, we'd probably get it. So on most reasonable priors that level of interpretability for GPT-2 seems wildly unlikely. My YES is part due to "skin in the game"-type self-motivation and in part because I expect AI-helping-interpretability to be unreasonably effective, though 2030 still seems like a tough ask.

Gigacasting avatar
Gigacasting

can we look inside yud’s brain and examine his motives 🤔

What about any organization 🤔

Gigacasting avatar
Gigacasting

One can quite literally read the code now 🤔

/ this is insane and demonstrates no understanding of what a neural network is

Related markets

Will mechanistic interpretability be essentially solved for GPT-3 before 2030?14%
Will mechanistic interpretability be essentially solved for GPT-4 before 2030?10%
Will we train GPT-4 to generate resolution criteria better than the creator 50% of the time by the end of 2023?27%
GPT-Zero: By 2030, will anyone develop an AI with a massive GPT-like knowledge base that it taught itself?24%
Will there be a fully self-sustaining Auto-GPT agent in 2023?27%
Will we have an open-source model that is equivalent GPT-4 by end of 2025?82%
Will a GPT4-equivalent model be able to run on consumer hardware before 2024?28%
Will there be a OpenAI LLM known as GPT-4.5? by 203373%
Will GPT-4's max context window increase by the end of 2023?37%
Will a GPT-3 quality model be trained for under $10.000 by 2030?80%
Will a GPT-4 quality model be trained for under $10.000 by 2030?78%
(M20000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)41%
By 2024, GPTs are proven to be able to infer scientific principles from linguistic data.47%
Will there be a language model by OpenAI called GPT-5, this decade?89%
Will a GPT-3 quality model be trained for under $1,000 by 2030?74%
In two years (2025) how will Generative Pre-trained Transformer (GPT) technology come to market?
Will (DeepMind text model) exceed chatGPT interest? (by 2025)22%
Will a GPT-4 level system be trained for <$1mm by 2030?89%
Will there be a OpenAI LLM known as GPT-4.5 in 2023?29%
(M1000 subsidy) Will GPT-4 solve any freshly-generated Sudoku puzzle? (2023)55%