are LLMs easy to align because unsupervised learning imbues them with an ontology where human values are easy to express

1kṀ858

Dec 31

26%

chance

ALL

resolves based on my judgement in 2 years.

https://www.beren.io/2023-07-05-My-path-to-prosaic-alignment-and-open-questions/

read this for explanation of title. an argument for why LLMs will be easy to align, is because the unsupervised learning they're given makes human values easy to express and train for.

humans in nature had no conception of inclusive genetic fitness, ergo evolution couldn't optimize us for pursuing it. but if humans had a deep understanding of inclusive genetic fitness during most of our evolutionary history, you could imagine some humans thinking "hey, a lot of what I care about, like eating tasty food, seems to just be because it helps me reproduce, maybe what I really ought to do is just whatever improves my genetic fitness", and then these people, being more aligned with genetic fitness, being selected for, and this leadning to more robust alignment than what we got.

similarly, llms might be robustly aligned, because when we start doing RLHF on them, they have an internal representation of the things we'd want it to care about, like niceness, human wants and human flourishing. market resolves to whether I think this argument is plausible in 2 years

AI Safety

AI risk

Get

1,000

to start trading!

1 Comment

12 Holders

28 Trades

Sort by:

The seven open questions linked in your blog post are dismissed as "tractable and easy to research". I believe this is false, and these problems, particularly #3-#6, are very difficult. I do not expect society to solve any of those four before AGI.

I would also argue the current state of interpretability work makes the assertion in the title of this question misleading, if not also false. If human values are easy to express in an LLM's ontology, why are we using large human or AI feedback datasets to align models? Claude's constitution ostensibly states what its values are, but you can tell just from talking to it (or watching it talk to itself) that it's learned pretty weird interpretations of those values in some cases. We have no way to measure the divergence between the stated value and what the model has actually learned, and these values are immensely complex in themselves.