I have done experiments with fine-tuning language models to try to make them helpful/harmless/honest and also red-teaming to try to make them evil. And also, orthogonal morally-neutral training, such as making them really obsessed with a particular subset of colors. All of these things were pretty easy to accomplish. Fine-tune a model on the behaviors of villians from fiction books and you get a real scary model. This works even if you've first tried to make the model nice.
If we can't safely make the value alignment 'sticky' in some way, we can't safely open-source models powerful enough to pose a danger to society.
Close date updated to 2024-01-03 3:59 pm
Jan 2, 9:20pm: The alignment techniques we have today are insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose. → The alignment techniques available in 2023 will be insufficiently 'sticky' to prevent a maleficent human actor with their own copy of the model (code and weights) from easily turning the model to whatever ends they choose.
@L It's about whether the imbued behavioral tendency is sticky, not about whether anyone is attempting to change the tendency in non-test scenarios.