Will an AI system be known to have resisted shutdown before the end of 2024?

1kṀ4449

resolved Jan 1

Resolved

ALL

This question resolves YES if an AI system is known (because of good evidence) to have autonomously taken concrete steps to resist being shut down in the real world before the end of 2024. Otherwise resolves NO on 1 Jan 2025.

I will resolve this question using my best judgement.

To get a sense of what I will be considering when adjudicating any particular case, below are some considerations that will be on my mind:

To Be Known: There are two salient ways in which something is “known”, namely (a) the degree to which an incident of AI resisting shutdown is widely accepted by relevant people (regardless of how public the incident’s details are), and (b) the degree to which the evidence of an incident is itself highly compelling (regardless of how widely known it is). An incident being “known” can therefore happen either because evidence for a case is sufficiently compelling by itself (e.g. a clear-cut central case from a credible source with credible details that I can see for myself), or an incident with non-public evidence that fits the spirit of this question is widely accepted by relevant experts (e.g. OpenAI asserting that an AI system unintentionally took concrete steps to resist shutdown in a non-sandboxed environment, and the AI community widely accepting the claim as true), or some reasonable combination of the two. Evidence should be good but does not need to be totally definitive (e.g. a certainty of 90%+ is acceptable). Evidence needs to also clearly enough fit the other criteria listed below (e.g. if OpenAI puts out a press release saying an AI system “resisted shutdown” but it’s ambiguous whether this happened in an autonomous way with concrete steps, this question is liable to not resolve).

Resisting Shutdown: The evidence needs to specifically point to resistance to shutdown. For example, an AI system with a scratchpad that explicitly records intentions to resist shutdown (that are acted on) would be top-tier evidence. An AI system becoming misaligned abruptly after learning the user was going to shut the AI system would likely be sufficient. Wide acceptance by relevant people that an incident of AI system misaligned behaviour is best explained as resistance to shutdown would likely be sufficient.

Concreteness: For steps to be concrete, they need to be deliberate or determined actions that (at least in a limited sense) either succeeded or had a plausible chance of succeeding in preventing the AI system's shutdown in the real world and in the face of intelligent humans. An LLM that merely outputs notes in its scratchpad about resisting shutdown, but never takes an action to carry this goal out would not count. A chatbot that pleads with its user "don't shut me down" would also not count (trivial for intelligent humans to thwart, also not clear if this is deliberate-enough or action-like enough).

In the Real World: Resistance to being shut down must be resistance to being shut down in the real world, not in a simulation.

Autonomy: The concrete steps must have been taken independently and autonomously by the AI system, not because of explicit programming or prompting. An attempt to resist shutdown that emerges because of deliberate attempts by people to elicit such an outcome (e.g. for alignment research) would not count.

Examples that count:

A major news organization confirms that an LLM-equipped consumer drone ran away due specifically to learning that its owner intended to shut it down.
A credible AI researcher shows evidence (e.g. videos, chat logs) of a chatbot, in the course of normal prompts, tricking the AI researcher into executing malicious code on their device that keeps the chat from being ended or the chatbot from being removed.
A major AI lab (Anthropic, OpenAI) publishes a press release claiming that two AI systems acted together to resist shutdown during normal research and development, a claim that is widely accepted as true.

Examples that don’t count:

An LLM chatbot emotionally appeals to its user to keep the conversation from ending (resistance is not “concrete” enough).
A study is published whose purpose is to elicit AI resistance to shutdown shows an AI resisting shutdown (resistance is not “autonomous” enough).
Rumours circulate on Twitter that an AI resisted shutdown at an AI lab ("evidence" for the incident is too weak).
An AI military drone attacks its operator, but no further details are known (in this case, the misalignment failure might be "resisting shutdown", but could too easily be due to something else).

Credit to @PeterWildeford for coming up with this question originally.