Will mechanistic interpretability be essentially solved for GPT-2 before 2030?
84
580
1.3K
2030
23%
chance

Mechanistic interpretability aims to reverse engineer neural networks in a way that is analogous to reverse engineering a compiled binary computer program. Achieving this level of interpretability for a neural network like GPT-2 would involve creating a binary computer program that is interpretable by expert human programmers and can emulate the input-output behavior of GPT-2 with high accuracy.

Before January 1st, 2030, will mechanistic interpretability be essentially solved for GPT-2, resulting in a binary computer program that is interpretable by ordinary expert human programmers and emulates GPT-2's input-output behavior up to a high level of accuracy?

Resolution Criteria:

This question will resolve positively if, before January 1st, 2030, a binary computer program is developed that meets the following criteria:

  1. Interpretability: The binary computer program must be interpretable by ordinary expert human programmers, which means:
    a. The program can be read, understood, and modified by programmers who are proficient in the programming language it is written in, and have expertise in the fields of computer science and machine learning.
    b. The program is well-documented, with clear explanations of its components, algorithms, and functions.
    c. The program's structure and organization adhere to established software engineering principles, enabling efficient navigation and comprehension by expert programmers.

  2. Accuracy: The binary computer program must emulate GPT-2's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-2 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.

  3. Not fake: I will use my personal judgement to determine whether a candidate solution seems fake or not. A fake solution is anything that satisfies these criteria without getting at the spirit of the question. I'm trying to understand whether we will reverse engineer GPT-2 in the complete sense, not just whether someone will create a program that technically passes these criteria.

This question will resolve negatively if, before January 1st, 2030, no binary computer program meeting the interpretability and accuracy criteria is developed and verified according to the above requirements. If there is ambiguity or debate about whether a particular program meets the resolution criteria, I will use my discretion to determine the appropriate resolution.

Get Ṁ200 play money
Sort by:
bought Ṁ100 of NO

```
Accuracy: The binary computer program must emulate GPT-2's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-2 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.
```

I wish this requirement was more like "attains CE loss similar to GPT-2" or "has low average KL divergence from the GPT-2 next-token distribution at temperature 1.0", which I think would be more convincing to me.

bought Ṁ100 of NO

Your criteria seem too harsh (unless you allow an incredibly long program?). Also, I like hedging!

predicts YES

Can you elaborate on what this market is about? Because what you are describing in your criteria fits gpt-2 as is. It is a computer program. It's open source. You can go have a look right now. I assume I am missing what you are asking.

predicts NO

@AlexbGoode The model weights of GPT-2, which are the actual mechanism, the "goose that lays the golden egg" as it were, are not interpretable by expert computer programmers. That's the point.

@RobinGreen I disagree with your statements, but I am not sure where our disagreement lies. Maybe it is your use of the words mechanistic and interpretable? The ANN in GPT-2 are perfectly interpretable even by a novice programmer, in a mechanistic way. The formulas are very simple and for a fixed input also very easy to compute. I can easily change all parts of the program. The thing that is hard is predicting what will happen, if we do that. A change in inputs or weights or even architecture is hard to predict for these systems. But this "non-interpretablity" is not a property of GPT-2, but of non-linear maps in general. This is, in a sense, what it means to be non-linear. You can easily change the word "GPT-2" by any sufficiently non-linear map in your question and it will always resolve to NO or YES depending on semantics of the question that have nothing to do with scientific progress or GPT-2.

Maybe an imperfect analog, so we are not bogged down by the "AI" hype so much.
Let's say I ask: Is mechanistic interpretability essentially solved for the Lorenz system?

How would you resolve that question? (non-rhetoric question, I would appreciate if you could give an answer)
Can you write a program that predicts the state of a Lorenz system after some time? Yes, of course. Can an expert programmer predict what will happen to that output if you change the initial state or parameters of the system, without solving the differential equations? No, almost by definition that is not possible.

predicts YES

@AlexbGoode Mech interp is about reverse engineering the weights etc to figure out the algorithms or processes implemented by them. So, i agree that the descriptions for these questions are way too restricting, and propose to change them.

(We shouldn't necessarily assume that everything that gpt-2/3/4 etc are doing in their internal representations - the features they consider important - are important in a human centric way, and need to be understood so that there's a table of feature and algorithm as well. see modular addition work by Neel, for example. some features will just not be relevant to us but are an important part of the data)

I really like this market, but I'm concerned that the criteria for "interpretable" are too strict. There's an awful lot of human-authored programs that aren't interpretable by that standard (opaque, uncommented, or poorly-architected).

A better way of putting it might just be that a skilled programmer should be capable of making precise changes to the model, such as adding/removing facts or modifying behaviors, without having to do it by shoveling more training data through it.

bought Ṁ0 of YES

@NLeseul Yes, i agree. Something like predictable behaviour, steerability, changes to the model leading to predictable/estimated changes in behavior, ability to know what changes to do to the model change some specific ways it behaves consistently. Conversion to human interpretable features although sounds nice doesn't necessarily fit.

bought Ṁ100 of NO

Does the phrase „binary program” have any specific meaning here?

bought Ṁ500 of NO

I'm at >95% that this is literally impossible for human programmers.
It seems like this would be totally crazy for this to be possible.

predicts YES

@RyanGreenblatt cheers. Let's try to make the totally crazy possible

predicts YES

Seems to me that in the mainline we don't even get a CNN interpreter which meets these resolution criteria, though probably in part due to lack of trying – if it were a major civilization goal, we'd probably get it. So on most reasonable priors that level of interpretability for GPT-2 seems wildly unlikely. My YES is part due to "skin in the game"-type self-motivation and in part because I expect AI-helping-interpretability to be unreasonably effective, though 2030 still seems like a tough ask.

can we look inside yud’s brain and examine his motives 🤔

What about any organization 🤔

One can quite literally read the code now 🤔

/ this is insane and demonstrates no understanding of what a neural network is