In 2025, category distribution for solved problems from the 200 Concrete Open Problems in Mechanistic Interpretability?
Mini
6
Ṁ296
Jan 1
14%
Toy language models
27%
Circuits in the wild
17%
Interpreting algorithmic problems
6%
Polysemanticity and superposition
7%
Analyzing training dynamics
11%
Tooling and automation
7%
Image model interpretability
5%
Reinforcement learning interpretability
5%
Learned features in large language models

The 200 Concrete Open Problems in Mechanistic Interpretability is a list of 200 concrete research questions in neural net interpretability, proposed in December 2022 by Neel Nanda. (A centralized table of all problems is available on this Google Sheet and this Coda document.) The problems are divided into the following categories (which I've decapitalized for readability):

  • Toy language models

  • Circuits in the wild

  • Interpreting algorithmic problems

  • Polysemanticity and superposition

  • Analyzing training dynamics

  • Tooling and automation

  • Image model interpretability

  • Reinforcement learning interpretability

  • Learned features in language models


This market resolves MULTI to the distribution of categories for problems solved before January 1, 2025. I plan to use the Coda document to resolve this market (if it goes down or becomes obviously untrustworthy, I'll use the Google Sheet as a backup). If there's no way I can find out the category distribution, or if human civilization falls in the meantime, then this market resolves N/A.

To make New Years' Day 2025 more interesting, this market will close and resolve 32 minutes after midnight EST.

EDIT: switching to 32 minutes to increase the gap, and EST since that'll be my actual timezone

EDIT 2: completing incomplete sentence

Get Ṁ1,000 play money
Sort by:

The grokking thing would fall within "Analyzing training dynamics"?

@mariopasquato yes, question 5.4