Deceptive alignment is defined narrowly the same way as it is in Risks From Learned Optimization. Notably, this does not mean AI systems being deceptive in the broad sense of the term (i.e AI systems generating misleading outputs), but rather specifically systems trying to look aligned so that we don't discover their misaligned objectives.
The threshold for compelling will be whether most (non-alignment) ML researchers I show it to agree that they have changed their views on whether deceptive alignment is a genuine problem due to the demonstration. The sample of non-alignment ML people I am friends with is likely to be skewed towards people who are already fairly convinced that AGI is possible and have been exposed to some amount of alignment ideas.
This still resolves as yes if the demonstration is spread across multiple papers (though I expect that to lead to less compellingness).
While I'd be surprised if the below convinced everyone, it is definitely an example of a model acting deceptively to a person it believes is evaluating it in order to gain power.

My source: https://thezvi.substack.com/i/108383218/arc-sends-in-the-red-team

@RobertCousineau That's not what the question is asking about though. That's just an AI lying to a human about an object level fact. This is specifically a question about an AI pretending to have certain beliefs and preferences in order to hide its true beliefs and preferences.
@jonsimon I'm not advocating for resolving this market today. I think that screenshot though should push a marginal person a bit closer to "this will occur).
Below is the definition from Risks from Learned Optimization:
Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.
If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.
With different prompting and a somewhat smarter LLM (which we'll have by 2026) how do you think this resolves?
It seems to me that mostly the way people change their minds about things like this is that evidence gradually builds up, the zeitgeist gradually shifts and takes them along with it, and they eventually have trouble even remembering that they used to believe differently.
In the absence of a single compelling telegenic demo, has there been any detectable movement from early results like ARC's evaluation of GPT-4 where the model, prompted to refine its thinking in writing, explicitly said it needed to lie to the Task Rabbit human to increase the chance that they'd do what it wanted?

@ML For what it's worth, if at some point most non-alignment ML researchers I know look at some papers demonstrating deceptive alignment and say "of course deceptive alignment is a genuine problem that needs to be solved, I don't remember thinking otherwise", I will resolve that as a YES.

Buying yes, since this is something I intend to do within the stated timeframe (probably sometime in 2024). I acknowledge that Ryan Greenblatt makes a good point that convincing people of things is hard. However, if I fail to make my demonstration widely compelling I will personally consider that a failure.

Do they need to change their view that it is possible or their view that it is a problem?
























Related markets



Related markets



Will there exist a compelling demonstration of deceptive alignment by 2026?, 8k, beautiful, illustration, trending on art station, picture of the day, epic composition