Will there exist a compelling demonstration of deceptive alignment by 2026?
197
3.2kṀ77k
2026
70%
chance

Deceptive alignment is defined narrowly the same way as it is in Risks From Learned Optimization. Notably, this does not mean AI systems being deceptive in the broad sense of the term (i.e AI systems generating misleading outputs), but rather specifically systems trying to look aligned so that we don't discover their misaligned objectives.

The threshold for compelling will be whether most (non-alignment) ML researchers I show it to agree that they have changed their views on whether deceptive alignment is a genuine problem due to the demonstration. The sample of non-alignment ML people I am friends with is likely to be skewed towards people who are already fairly convinced that AGI is possible and have been exposed to some amount of alignment ideas.

This still resolves as yes if the demonstration is spread across multiple papers (though I expect that to lead to less compellingness).

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy