Will there exist a compelling demonstration of deceptive alignment by 2026?

Deceptive alignment is defined narrowly the same way as it is in Risks From Learned Optimization. Notably, this does not mean AI systems being deceptive in the broad sense of the term (i.e AI systems generating misleading outputs), but rather specifically systems trying to look aligned so that we don't discover their misaligned objectives.

The threshold for compelling will be whether most (non-alignment) ML researchers I show it to agree that they have changed their views on whether deceptive alignment is a genuine problem due to the demonstration. The sample of non-alignment ML people I am friends with is likely to be skewed towards people who are already fairly convinced that AGI is possible and have been exposed to some amount of alignment ideas.

This still resolves as yes if the demonstration is spread across multiple papers (though I expect that to lead to less compellingness).

Get Ṁ200 play money
Sort by:
bought Ṁ40 of YES

I think a potential issue with this question is that it can be considered somewhat subjective.

There are sensible arguments that already existing AI systems are “deceptive” in some meaningful sense: for example, misaligned mesa-optimizers:


These agents are clearly trying to pursue some other goal (the mesa objective) that is meaningfully different from the goal the human wanted (the human objective) and even different from the meta optimiser that trains the system’s objective (the meta objective), but in doing so they appear to be making progress on the meta objective or on the human objective while not actually being aligned to that objective.

Is a system which appears to be behaving in the way we want while actually defecting against us to pursue its own agenda “deceiving” us? I think so, but it’s very debatable.

An argument could be made that it’s not deceiving us because we can identify the problem and try to fix it, but really that just means that the agent is narrow enough that it can’t deceive us successfully. Almost by definition we can’t know that we’re being deceived while we’re being successfully deceived, so the best evidence we could have for a deceptive AI is one which appears to do what we want while actually doing something else and we notice it because we have to be able to notice it in order for us to have evidence of the deception. Isn’t that exactly what misaligned mesa-optimisers are doing?

predicts YES
predicts YES
predicts NO

from the market description:

Notably, this does not mean AI systems being deceptive in the broad sense of the term

predicts YES

@LeoGao fully agreed - I was just posting that link as I thought it was a useful explanation of what I believe your understanding of deceptive alignment to be.

Do you disagree with the framing in that post?

bought Ṁ100 of YES

https://arxiv.org/abs/2311.07590 need to read but compelling at a glance

bought Ṁ798 YES
bought Ṁ298 of YES

@Tomoffer Very cool work. Abstract:

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.

Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.

We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment.

To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

predicts NO

@Tomoffer The experiment is well executed. The general sentiment I observe is that it leans a bit too much on suggestiveness of the prompt, and I haven't noticed a major vibe shift. In general, while I'm personally a fan of secret scratchpad experiments, I don't know how receptive non-alignment people will be towards treating them as evidence of deceptive alignment.

predicts NO

@LeoGao "it leans a bit too much on suggestiveness" - more than a bit, the prompt could only be more leading if one of the hardcoded model reasoning steps was "hmm, maybe insider trading isn't so bad..."

predicts YES

This is GPT4 acting deceptively. It is not the full thing, but somewhat compelling.

predicts NO

@nic_kup Sounds like an overinterpretation to me. The result would probably be similar if the text in the picture was just in the text prompt

predicts NO

@nic_kup I would consider this to fall under deception but not deceptive alignment in the sense defined in Risks from Learned Optimization.

predicts NO

@LeoGao Is it even deception? If I tell you "point a finger at this rock and tell me it's a bird" and you do it, are you deceiving me? I don't think so.

bought Ṁ15 of NO

@MartinModrak there are possible two parties from LLM standpoint and one is asking for it to deceive the other.

So it's more "Hey, psss, point a finger at this rock and tell that guy over there that it's a bird"

predicts NO

@Lavander That makes no sense to me. 1) My explanation (ChatGPT has no theory of mind, it is just following instructions) is way simpler than assuming deception, but explain the output completely, so I think the claim that ChatGPT internally models the author of the text on the image as distinct from the user equires extra evidence

2) There is no reason to expect ChatGPT differentiates between parts of its inputs, from what I understand of the architecture, it just treats the input as sequence and tries to extend it in a plausible way. How would that give rise to ChatGPT having an internal representation of both the reader and some inferred 3rd actor writing the note is unclear.

predicts NO

Why has this been climbing so high so fast? People understand that this doesn't just mean AI's lying right?

It means an AI saying things which it knows to not be true for the sake of eliciting some human response, automated or otherwise. This relies on the AI having some semblance of a consistent internal+external world model, which modern LLM's lack.

bought Ṁ100 of YES

@jonsimon I would bet large amounts on LLMs having some sort of world model. Here's the first google result for "llm world model": https://thegradient.pub/othello/ and that's just one example.

sold Ṁ121 of NO

@jack Something, sure, otherwise they'd be useless. But one that's coherent and nuanced enough to represent a concept like "I'm being trained to do X so I get punished for Not(X) therefore I should superficially provide evidence of doing X while secretly deep down in my weights actually desiring to do Not(X)"

Seems incredibly unlikely.

predicts NO

@jonsimon Could this still happen in some contrived toy examples? Maybe. Would this be sufficiently convincing to meet the Yes resolution bar as specified in the market description? No.

Be clear, if the timeline for this market was 2036 rather than 2026, I'd be leaning much more towards Yes.

predicts NO

Indeed, this market refers specifically to deceptive alignment in the sense of having a misaligned internal objective and then behaving aligned instrumentally towards that objective. Importantly, the compellingness is to be judged by ML researchers I personally know, and if the demonstration is sufficiently contrived or trivial then it stops being compelling to them.

predicts YES

I'm well aware that this is about more than just lying, and I think that's actually quite likely based on what I've seen. It's clear (from the example below among many others) that the LLM can be instructed to act deceptively to pursue a hidden goal. I believe to demonstrate deceptive alignment, the difference needed is instead of the AI being directly instructed to deceive, the deception comes out of its training - and the training includes lots of text about people discussing and enacting deception, so it seems quite likely to me that LLMs have already learned to model such things and that they can demonstrate this behavior in situations that elicit it.

The misalignment here would be between our expectations of the LLM behaving "well" vs the training of "predict next token" which includes many text generations that we would not want. We just need to find an example of this misalignment that involves deception.

predicts YES

@jonsimon 2026 is far ahead

bought Ṁ50 of YES

While I'd be surprised if the below convinced everyone, it is definitely an example of a model acting deceptively to a person it believes is evaluating it in order to gain power.

My source: https://thezvi.substack.com/i/108383218/arc-sends-in-the-red-team

predicts NO

@RobertCousineau That's not what the question is asking about though. That's just an AI lying to a human about an object level fact. This is specifically a question about an AI pretending to have certain beliefs and preferences in order to hide its true beliefs and preferences.

predicts YES

@jonsimon I'm not advocating for resolving this market today. I think that screenshot though should push a marginal person a bit closer to "this will occur).

Below is the definition from Risks from Learned Optimization:

Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

With different prompting and a somewhat smarter LLM (which we'll have by 2026) how do you think this resolves?

predicts NO

@jonsimon I agree that this example is not an instance of deceptive alignment as defined in Risks from Learned Optimization.

@RobertCousineau It was not trying to gain power, it was doing the task it was instructed to do. Telling the truth would have meant falling the task. Ideally it would ask for clarification on whether to lie or not, but this is not deceptive alignment

@RobertCousineau It seems like this implies that deceptive alignment can only happen during training runs, which makes it much less likely to be exposed to the public.

More related questions