In 2025, category distribution for solved problems from the 200 Concrete Open Problems in Mechanistic Interpretability?

405Ṁ296

resolved Jan 1

36%17%

Interpreting algorithmic problems

21%27%

Circuits in the wild

14%7%

Image model interpretability

11%6%

Polysemanticity and superposition

11%7%

Analyzing training dynamics

4%5%

Reinforcement learning interpretability

4%5%

Learned features in large language models

14%

Toy language models

11%

Tooling and automation

The 200 Concrete Open Problems in Mechanistic Interpretability is a list of 200 concrete research questions in neural net interpretability, proposed in December 2022 by Neel Nanda. (A centralized table of all problems is available on this Google Sheet and this Coda document.) The problems are divided into the following categories (which I've decapitalized for readability):

Toy language models
Circuits in the wild
Interpreting algorithmic problems
Polysemanticity and superposition
Analyzing training dynamics
Tooling and automation
Image model interpretability
Reinforcement learning interpretability
Learned features in language models

This market resolves MULTI to the distribution of categories for problems solved before January 1, 2025. I plan to use the Coda document to resolve this market (if it goes down or becomes obviously untrustworthy, I'll use the Google Sheet as a backup). If there's no way I can find out the category distribution, or if human civilization falls in the meantime, then this market resolves N/A.

To make New Years' Day 2025 more interesting, this market will close and resolve 32 minutes after midnight EST.

EDIT: switching to 32 minutes to increase the gap, and EST since that'll be my actual timezone

EDIT 2: completing incomplete sentence

Technology

AI Safety

AI Alignment

Mechanistic interpretability

Get

1,000

to start trading!