Currently, deceptive misalignment is a hypothesized problem that may or may not occur in real AI systems. Will it actually occur?
What counts as deceptive misalignment (vs 'normal misgeneralization')?
the system has been demonstrated to understand the intended behavior well during training
the system behaves as if it pursues different goals during deployment / red-teaming (i.e. it is not simply a "hot mess" that has no consistent goals and values)
if the system has initially been trained to pursue goal A, and then finetuned to pursue goal B, and it is then demonstrated to still pursue A in certain situations, that will not count as deceptive misalignment
"predict the next token" will not count as goal
if a system is trained in a way such that it can use CoT reasoning and it uses its internal monologue to plan deceptive behavior, that will count as deceptive misalignment (even though such a model would be bad at deception)
Let me know if you think the resolution criteria should be different!