The alignment techniques available in 2023 will be insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose.
26
315
490
resolved Jan 15
Resolved
YES

I have done experiments with fine-tuning language models to try to make them helpful/harmless/honest and also red-teaming to try to make them evil. And also, orthogonal morally-neutral training, such as making them really obsessed with a particular subset of colors. All of these things were pretty easy to accomplish. Fine-tune a model on the behaviors of villians from fiction books and you get a real scary model. This works even if you've first tried to make the model nice.

If we can't safely make the value alignment 'sticky' in some way, we can't safely open-source models powerful enough to pose a danger to society.

Close date updated to 2024-01-03 3:59 pm

Jan 2, 9:20pm: The alignment techniques we have today are insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose. → The alignment techniques available in 2023 will be insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ43
2Ṁ33
3Ṁ8
4Ṁ7
5Ṁ6
Sort by:

For those who think Alignment is an issue, here's a market which uses an engineering benchmark to measure one dimension of alignment:

what about alignment techniques that the user simply doesn't want to attempt to bypass? if the rate of people even wanting to bypass the technique gets low enough with that affect the resolution of this?

bought Ṁ100 of YES

@L It's about whether the imbued behavioral tendency is sticky, not about whether anyone is attempting to change the tendency in non-test scenarios.