Mechanistic interpretability aims to reverse engineer neural networks in a way that is analogous to reverse engineering a compiled binary computer program. Achieving this level of interpretability for a neural network like GPT-2 would involve creating a binary computer program that is interpretable by expert human programmers and can emulate the input-output behavior of GPT-2 with high accuracy.
Before January 1st, 2030, will mechanistic interpretability be essentially solved for GPT-2, resulting in a binary computer program that is interpretable by ordinary expert human programmers and emulates GPT-2's input-output behavior up to a high level of accuracy?
Resolution Criteria:
This question will resolve positively if, before January 1st, 2030, a binary computer program is developed that meets the following criteria:
Interpretability: The binary computer program must be interpretable by ordinary expert human programmers, which means:
a. The program can be read, understood, and modified by programmers who are proficient in the programming language it is written in, and have expertise in the fields of computer science and machine learning.
b. The program is well-documented, with clear explanations of its components, algorithms, and functions.
c. The program's structure and organization adhere to established software engineering principles, enabling efficient navigation and comprehension by expert programmers.Accuracy: The binary computer program must emulate GPT-2's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-2 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.
Not fake: I will use my personal judgement to determine whether a candidate solution seems fake or not. A fake solution is anything that satisfies these criteria without getting at the spirit of the question. I'm trying to understand whether we will reverse engineer GPT-2 in the complete sense, not just whether someone will create a program that technically passes these criteria.
This question will resolve negatively if, before January 1st, 2030, no binary computer program meeting the interpretability and accuracy criteria is developed and verified according to the above requirements. If there is ambiguity or debate about whether a particular program meets the resolution criteria, I will use my discretion to determine the appropriate resolution.
```
Accuracy: The binary computer program must emulate GPT-2's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-2 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.
```
I wish this requirement was more like "attains CE loss similar to GPT-2" or "has low average KL divergence from the GPT-2 next-token distribution at temperature 1.0", which I think would be more convincing to me.
Current SotA on interpreting GPT2: https://openai.com/research/language-models-can-explain-neurons-in-language-models
@AlexbGoode The model weights of GPT-2, which are the actual mechanism, the "goose that lays the golden egg" as it were, are not interpretable by expert computer programmers. That's the point.
@RobinGreen I disagree with your statements, but I am not sure where our disagreement lies. Maybe it is your use of the words mechanistic and interpretable? The ANN in GPT-2 are perfectly interpretable even by a novice programmer, in a mechanistic way. The formulas are very simple and for a fixed input also very easy to compute. I can easily change all parts of the program. The thing that is hard is predicting what will happen, if we do that. A change in inputs or weights or even architecture is hard to predict for these systems. But this "non-interpretablity" is not a property of GPT-2, but of non-linear maps in general. This is, in a sense, what it means to be non-linear. You can easily change the word "GPT-2" by any sufficiently non-linear map in your question and it will always resolve to NO or YES depending on semantics of the question that have nothing to do with scientific progress or GPT-2.
Maybe an imperfect analog, so we are not bogged down by the "AI" hype so much.
Let's say I ask: Is mechanistic interpretability essentially solved for the Lorenz system?
How would you resolve that question? (non-rhetoric question, I would appreciate if you could give an answer)
Can you write a program that predicts the state of a Lorenz system after some time? Yes, of course. Can an expert programmer predict what will happen to that output if you change the initial state or parameters of the system, without solving the differential equations? No, almost by definition that is not possible.
@AlexbGoode Mech interp is about reverse engineering the weights etc to figure out the algorithms or processes implemented by them. So, i agree that the descriptions for these questions are way too restricting, and propose to change them.
(We shouldn't necessarily assume that everything that gpt-2/3/4 etc are doing in their internal representations - the features they consider important - are important in a human centric way, and need to be understood so that there's a table of feature and algorithm as well. see modular addition work by Neel, for example. some features will just not be relevant to us but are an important part of the data)
I really like this market, but I'm concerned that the criteria for "interpretable" are too strict. There's an awful lot of human-authored programs that aren't interpretable by that standard (opaque, uncommented, or poorly-architected).
A better way of putting it might just be that a skilled programmer should be capable of making precise changes to the model, such as adding/removing facts or modifying behaviors, without having to do it by shoveling more training data through it.
@NLeseul Yes, i agree. Something like predictable behaviour, steerability, changes to the model leading to predictable/estimated changes in behavior, ability to know what changes to do to the model change some specific ways it behaves consistently. Conversion to human interpretable features although sounds nice doesn't necessarily fit.
Seems to me that in the mainline we don't even get a CNN interpreter which meets these resolution criteria, though probably in part due to lack of trying – if it were a major civilization goal, we'd probably get it. So on most reasonable priors that level of interpretability for GPT-2 seems wildly unlikely. My YES is part due to "skin in the game"-type self-motivation and in part because I expect AI-helping-interpretability to be unreasonably effective, though 2030 still seems like a tough ask.