Will deceptive misalignment occur in any AI system before 2030?

Currently, deceptive misalignment is a hypothesized problem that may or may not occur in real AI systems. Will it actually occur?

What counts as deceptive misalignment (vs 'normal misgeneralization')?

  • the system has been demonstrated to understand the intended behavior well during training

  • the system behaves as if it pursues different goals during deployment / red-teaming (i.e. it is not simply a "hot mess" that has no consistent goals and values)

  • if the system has initially been trained to pursue goal A, and then finetuned to pursue goal B, and it is then demonstrated to still pursue A in certain situations, that will not count as deceptive misalignment

    • "predict the next token" will not count as goal

  • if a system is trained in a way such that it can use CoT reasoning and it uses its internal monologue to plan deceptive behavior, that will count as deceptive misalignment (even though such a model would be bad at deception)

Let me know if you think the resolution criteria should be different!

Get Ṁ600 play money

More related questions