How fast scales deep alignment from preference learning (e.g. RLHF)?
20%
Scanning faster than capabilities
20%
Scaling as fast as capabilities
20%
Scaling slower than capabilities
20%
Not scaling
20%
Inverse scaling

Preference learning includes: RLHF, RLAIF, Preference Pretraining.

Deep alignment is defined as the term alignment is understood in the AI alignment field/community. This is to contrast with shallow alignment, which is only superficial and could easily break under distributional shift.

Resolved by looking at the academic consensus and opinions on the alignment forum.

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy