This question asks about the progress of mechanistic interpretability for GPT-2, specifically the extent to which researchers understand the internal mechanisms of the model. Mechanistic interpretability is considered “solved” when it is generally accepted by the research community that the majority of GPT-2’s internal computations and transformations are sufficiently understood, such that its behavior can be reliably explained and predicted in terms of the individual components (e.g., neurons, layers, or attention heads) and their interactions.
Resolution Criteria
1. Definition of “Solved”:
Mechanistic interpretability for GPT-2 will be considered 100% solved if the research community widely agrees that the internal operations of the model are fully understood in terms of their specific contributions to behavior across a broad range of tasks.
Partial resolution (e.g., 50%, 75%) is not allowed; the question resolves only when “solved” is generally agreed upon.
2. Indicators of Consensus:
Publications in top-tier AI conferences (e.g., NeurIPS, ICLR, ICML) or journals explicitly declaring that mechanistic interpretability for GPT-2 is solved.
Widespread agreement among researchers in recognized forums (e.g., AI alignment newsletters, major research lab announcements, community hubs like the Alignment Forum).
Benchmark studies demonstrating complete mechanistic understanding of GPT-2’s inner workings and the ability to reliably explain, predict, and manipulate the model’s behavior at a mechanistic level.
The consensus must be broad and sustained; isolated claims or papers are insufficient.
3. Model Version:
The question specifically pertains to the standard GPT-2 model, as originally released by OpenAI in 2019. Extensions, modifications, or other versions of GPT-2 are excluded from consideration.
4. Resolution Timing:
The question will remain unresolved until a general consensus is reached, irrespective of the calendar year. It resolves as “YES” only when the conditions above are met.
@IhorKendiukhov The title asks about the current percentage of "interpretability solved" for GPT-2, while the description points to something very different. It's basically:
This market resolves Yes, as soon as there is consensus that mechanistic interpretability is solved 100% for GPT-2. Otherwise it stays open indefinitely.
This might work a little as loans are back, but it probably won't tell you what you want to find out. I'd suggest to at least change the title (and probably N/A this in favor of a differently structured market).