Will we be able to estimate the feature importance curve or feature sparsity curve of real models? (2024 end)

Standard

Ṁ120

Dec 31

62%

chance

ALL

gpt-2 onwards. Comparable performance.

#Mechanistic interpretability

Get

1,000

and

1.00

2 Comments

Sort by:

My friend says that "features" could mean one of several different things ("parts of the model that correlate with human concepts", "linear things that the model uses as in TMOS [Toy Models of Superposition]", "some other decomposition of the model") and that there may not actually be a real "feature importance curve".

predicts YES

This is great.

Related questions

Related questions