https://chat.openai.com/share/543c2953-982b-4ef0-8ba8-967068140987
☝️Seems difficult, bigger model than gpt2
@EliezerYudkowsky It is quite fuzzy, I agree, and there are many different definitions for features.
Here I refer to a basic set of meaningful directions in the activation space from which more complex directions can be created from; these meaningful directions can be converted to human understandable concepts (to allow for the existence of features which are not human understandable), and the model actually learns and uses these directions as general ways to represent the properties of the input data.
The question is then, whether it will be possible to cleanly separate out these directions and to convert them into human understandable concepts for most of the properties of the data that the model is capable of representing and using.
@firstuserhere Does "human-understandable" means "at least one human understood it", or "all humans understood it", or something else?
@a2bb It is better to say human interpretable than understandable, but saying understandable in the text above makes that text easier for me to parse