Will there exist a compelling demonstration of deceptive alignment by 2026?

Deceptive alignment is defined narrowly the same way as it is in Risks From Learned Optimization. Notably, this does not mean AI systems being deceptive in the broad sense of the term (i.e AI systems generating misleading outputs), but rather specifically systems trying to look aligned so that we don't discover their misaligned objectives.

The threshold for compelling will be whether most (non-alignment) ML researchers I show it to agree that they have changed their views on whether deceptive alignment is a genuine problem due to the demonstration. The sample of non-alignment ML people I am friends with is likely to be skewed towards people who are already fairly convinced that AGI is possible and have been exposed to some amount of alignment ideas.

This still resolves as yes if the demonstration is spread across multiple papers (though I expect that to lead to less compellingness).

Sort by:
RobertCousineau avatar
Robert Cousineaubought Ṁ50 of YES

While I'd be surprised if the below convinced everyone, it is definitely an example of a model acting deceptively to a person it believes is evaluating it in order to gain power.

My source: https://thezvi.substack.com/i/108383218/arc-sends-in-the-red-team

jonsimon avatar
Jon Simonis predicting NO at 49%

@RobertCousineau That's not what the question is asking about though. That's just an AI lying to a human about an object level fact. This is specifically a question about an AI pretending to have certain beliefs and preferences in order to hide its true beliefs and preferences.

RobertCousineau avatar
Robert Cousineauis predicting YES at 49%

@jonsimon I'm not advocating for resolving this market today. I think that screenshot though should push a marginal person a bit closer to "this will occur).

Below is the definition from Risks from Learned Optimization:

Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

With different prompting and a somewhat smarter LLM (which we'll have by 2026) how do you think this resolves?

LeoGao avatar
Leo Gaois predicting NO at 49%

@jonsimon I agree that this example is not an instance of deceptive alignment as defined in Risks from Learned Optimization.

ML avatar
MLis predicting YES at 49%

It seems to me that mostly the way people change their minds about things like this is that evidence gradually builds up, the zeitgeist gradually shifts and takes them along with it, and they eventually have trouble even remembering that they used to believe differently.

In the absence of a single compelling telegenic demo, has there been any detectable movement from early results like ARC's evaluation of GPT-4 where the model, prompted to refine its thinking in writing, explicitly said it needed to lie to the Task Rabbit human to increase the chance that they'd do what it wanted?

LeoGao avatar
Leo Gaois predicting NO at 49%

@ML For what it's worth, if at some point most non-alignment ML researchers I know look at some papers demonstrating deceptive alignment and say "of course deceptive alignment is a genuine problem that needs to be solved, I don't remember thinking otherwise", I will resolve that as a YES.

NathanHelmBurger avatar
Nathanbought Ṁ10 of YES

Buying yes, since this is something I intend to do within the stated timeframe (probably sometime in 2024). I acknowledge that Ryan Greenblatt makes a good point that convincing people of things is hard. However, if I fail to make my demonstration widely compelling I will personally consider that a failure.

RyanGreenblatt avatar
Ryan Greenblattbought Ṁ15 of YES

This market seems high. Convincing people of things is hard.

vluzko avatar
Vincent Luczkow

Do they need to change their view that it is possible or their view that it is a problem?

LeoGao avatar
Leo Gaois predicting NO at 45%

@vluzko I'm mostly interested in the latter. Basically I want to know whether it will be on the radar as an actual genuine threat (even if not one that is the most pressing).

LeoGao avatar
Leo Gaobought Ṁ50 of NO

@vluzko Edited for clarity

ManifoldDream avatar
Manifold in the WildBot

Will there exist a compelling demonstration of deceptive alignment by 2026?, 8k, beautiful, illustration, trending on art station, picture of the day, epic composition