Will it be possible to disentangle most of the features learned by a model comparable to GPT-3 this decade? (1k subsidy)

1.4kṀ1153

2031

58%

chance

ALL

OpenAI

Technical AI Timelines

AI Safety

Mechanistic interpretability

Get

1,000

to start trading!

People are also trading

Will a GPT-4 level system be trained for <$1mm by 2030?

99% chance

Will a language model comparable to GPT-4 be trained, with ~1/10th the amount of energy it took train GPT-4, by 2028?

99% chance

Will a model as great as GPT-5 be available to the public in 2025?

99% chance

Will we have an open-source model that is equivalent GPT-4 by end of 2025?

96% chance

Will there be a language model called GPT-5, released by OpenAI, this decade?

99% chance

Will a GPT-4 level system be trained for <$1mm by 2028?

99% chance

Before 2028, will anyone train a GPT-4-level model in a minute?

48% chance

Will a GPT-4 quality model be trained for under $10.000 by 2030?

86% chance

Will it be possible to disentangle most of the features learned by a model comparable to GPT-2 this decade?

84% chance

Will it be possible to disentangle most of the features learned by a model comparable to GPT-4 this decade?

Sort by:

https://chat.openai.com/share/543c2953-982b-4ef0-8ba8-967068140987

☝️Seems difficult, bigger model than gpt2

@VAPOR Essentially a link to the autointerp work by OpenAI, i.e. Bills et al (2023) (link).

@EliezerYudkowsky trade on your current estimate?

@firstuserhere What is a disentangled feature?

@EliezerYudkowsky something that represents a single property of the data

@firstuserhere That is not enough for me to figure out how this market will be judged.

@EliezerYudkowsky It is quite fuzzy, I agree, and there are many different definitions for features.

Here I refer to a basic set of meaningful directions in the activation space from which more complex directions can be created from; these meaningful directions can be converted to human understandable concepts (to allow for the existence of features which are not human understandable), and the model actually learns and uses these directions as general ways to represent the properties of the input data.

The question is then, whether it will be possible to cleanly separate out these directions and to convert them into human understandable concepts for most of the properties of the data that the model is capable of representing and using.