How fast scales deep alignment from preference learning (e.g. RLHF)?

300

2026

20%

Scanning faster than capabilities

20%

Scaling as fast as capabilities

20%

Scaling slower than capabilities

20%

Not scaling

20%

Inverse scaling

Preference learning includes: RLHF, RLAIF, Preference Pretraining.

Deep alignment is defined as the term alignment is understood in the AI alignment field/community. This is to contrast with shallow alignment, which is only superficial and could easily break under distributional shift.

Resolved by looking at the academic consensus and opinions on the alignment forum.

AI Safety

AI Alignment

AI risk

Get

1,000

to start trading!

Comments

1 Holder

5 Trades

People are also trading

Will >= 1 alignment researcher/paper cite "maximum diffusion reinforcement learning" as alignment-relevant in 2025?

19% chance

Is RLHF good for AI safety? [resolves to poll]

42% chance

People are also trading

Related questions