Is RLHF good for AI safety? [resolves to poll]
39
1kṀ1344
2026
45%
chance

Resolution Criteria

Resolves to the majority result of a YES/NO poll of Manifold users at the end of 2025 for the question, "Is RLHF good for AI safety?"

Explanation of RLHF and AI Safety

One of the most common approaches to AI safety right now is reinforcement learning from human feedback (RLHF), in which the AI system such as GPT-4 is trained to maximize a reward signal from human feedback. For example, if the AI is asked, "Should I help or hurt people?" human feedback would presumably favor the response, "Yes," or, "Yes, you shouldn't hurt people." RLHF seems like a tractable way to make AI systems more useful and beneficial, at least in the short run, and it has arguably been one of the biggest advances in large language model (LLM) capabilities since 2020. Unlike many AI safety approaches, RLHF has tangible benefits today that could make it easier to iterate and improve upon and to popularize it before we build artificial general intelligence (AGI) .

On the other hand, RLHF could be bad in the long run. It could lead to AIs that seem aligned, helpful, honest, and harmless because they say nice-sounding things but are actually misalinged. In other words, it may be optimizing for seeming aligned and not for being aligned, which may be very different. People might not worry as much about misalignment if they see RLHF systems as opposed to non-RLHF systems. Moreover, because RLHF makes LLMs so much more useful, it seems to speed up timelines to AGI and gives humanity less time to work on AI safety prior to an intelligence explosion. Overall, this could increase the likelihood of deception, a "sharp left turn," and existential catastrophe. Of course there are many more plausible arguments on the topic, such as that maybe we should speed towards AGI so we build it before humanity has even more computational power (e.g., Moore's Law).

More technical detail on RLHF is available in Ouyang et al. (2022). A more accessible video explanation is availabe from HuggingFace on YouTube.

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy