are LLMs easy to align because unsupervised learning imbues them with an ontology where human values are easy to express
are LLMs easy to align because unsupervised learning imbues them with an ontology where human values are easy to express
12
1kṀ608
Dec 31
33%
chance

resolves based on my judgement in 2 years.

https://www.beren.io/2023-07-05-My-path-to-prosaic-alignment-and-open-questions/

read this for explanation of title. an argument for why LLMs will be easy to align, is because the unsupervised learning they're given makes human values easy to express and train for.

humans in nature had no conception of inclusive genetic fitness, ergo evolution couldn't optimize us for pursuing it. but if humans had a deep understanding of inclusive genetic fitness during most of our evolutionary history, you could imagine some humans thinking "hey, a lot of what I care about, like eating tasty food, seems to just be because it helps me reproduce, maybe what I really ought to do is just whatever improves my genetic fitness", and then these people, being more aligned with genetic fitness, being selected for, and this leadning to more robust alignment than what we got.

similarly, llms might be robustly aligned, because when we start doing RLHF on them, they have an internal representation of the things we'd want it to care about, like niceness, human wants and human flourishing. market resolves to whether I think this argument is plausible in 2 years

Get
Ṁ1,000
to start trading!


Sort by:
1y

The seven open questions linked in your blog post are dismissed as "tractable and easy to research". I believe this is false, and these problems, particularly #3-#6, are very difficult. I do not expect society to solve any of those four before AGI.

I would also argue the current state of interpretability work makes the assertion in the title of this question misleading, if not also false. If human values are easy to express in an LLM's ontology, why are we using large human or AI feedback datasets to align models? Claude's constitution ostensibly states what its values are, but you can tell just from talking to it (or watching it talk to itself) that it's learned pretty weird interpretations of those values in some cases. We have no way to measure the divergence between the stated value and what the model has actually learned, and these values are immensely complex in themselves.

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Or create your own play-money betting market on any question you care about.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like betting still use Manifold to get reliable news.
ṀWhy use play money?
Mana (Ṁ) is the play-money currency used to bet on Manifold. It cannot be converted to cash. All users start with Ṁ1,000 for free.
Play money means it's much easier for anyone anywhere in the world to get started and try out forecasting without any risk. It also means there's more freedom to create and bet on any type of question.
© Manifold Markets, Inc.TermsPrivacy