The alignment techniques available in 2023 will be insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose.

490Ṁ3721

resolved Jan 15

Resolved

YES

ALL

I have done experiments with fine-tuning language models to try to make them helpful/harmless/honest and also red-teaming to try to make them evil. And also, orthogonal morally-neutral training, such as making them really obsessed with a particular subset of colors. All of these things were pretty easy to accomplish. Fine-tune a model on the behaviors of villians from fiction books and you get a real scary model. This works even if you've first tried to make the model nice.

If we can't safely make the value alignment 'sticky' in some way, we can't safely open-source models powerful enough to pose a danger to society.

Close date updated to 2024-01-03 3:59 pm

Jan 2, 9:20pm: The alignment techniques we have today are insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose. → The alignment techniques available in 2023 will be insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose.

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ43
2		Ṁ33
3		Ṁ8
4		Ṁ7
5		Ṁ6

3 Comments

17 Holders

64 Trades

Sort by:

For those who think Alignment is an issue, here's a market which uses an engineering benchmark to measure one dimension of alignment:

what about alignment techniques that the user simply doesn't want to attempt to bypass? if the rate of people even wanting to bypass the technique gets low enough with that affect the resolution of this?

@L It's about whether the imbued behavioral tendency is sticky, not about whether anyone is attempting to change the tendency in non-test scenarios.

People are also trading

Will deceptive misalignment occur in any AI system before 2030?

81% chance

Will an AI built to solve alignment wipe out humanity by 2100?

12% chance

🏅 Top traders

People are also trading

Related questions