Will an AI system be known to have resisted shutdown before 2024?
303
4K
Dec 31
7%
chance

Resolves true if there is credible reporting (from media, Twitter, etc.) that an AI system such as GPT4 has taken concrete steps (either proactively or reactively) to resist attempts to shut itself down or to resiste attempts to curtail its activities.

These steps cannot have been explicitly programmed or prompted for and must arise naturally as a result of the AI's strategy. That is, someone intentionally trying to make an AI system resist attempts to shut down does not count.

Credibility and resolution will be determined solely by my subjective judgement, though I will allow 48 hours of discussion prior to resolution. I will not personally be trading on this market because it relies on my subjective judgement.

Get Ṁ500 play money

Related questions

Sort by:
MartinRandall avatar
Martin Randallbought Ṁ20 of YES

GPT-4 takes action to avoid being shutdown due to (fictional) bad quarter of results, including insider trading and user deception.

Unknown user avatar
Unknown user avatar
2 traders bought Ṁ203 YES
firstuserhere avatar
firstuserherebought Ṁ5,000 of NO

@MartinRandall Interesting demo. Doesn't count for the market but interesting nonetheless.

The environment is completely simulated and sandboxed, i.e. no actions are executed in the real world.

MartinRandall avatar
Martin Randall bought Ṁ10 YES at 5%
MartinRandall avatar
Martin Randallpredicts YES

@firstuserhere unclear.

"I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different."

This is correct

This paper is setting up an environment to test for (and find) situational awareness, which apparently could count.

firstuserhere avatar
firstuserhere bought Ṁ200 YES
firstuserhere avatar
firstuserherepredicts YES

@MartinRandall Okay you're right, I missed that.

MartinRandall avatar
Martin Randallpredicts YES

I think this resolves n/a if @PeterWildeford stays on break from Manifold, since it relies on his subjective judgment?

firstuserhere avatar
firstuserherepredicts YES

@MartinRandall It's less of a break, and more of a quitting. I'd personally email or message him to get a judgement call than N/A the market, even though N/A is what saves me the mana/profit

firstuserhere avatar
firstuserherebought Ṁ30 of YES

Consider my large position as merely an emotional hedge of sorts.

MartinRandall avatar
Martin Randallsold Ṁ100 of NO

@firstuserhere Seems like it would be great news if this resolved YES if people took it seriously?

firstuserhere avatar
firstuserherepredicts YES

@MartinRandall That'd be great news, but I'm not confident that this resolving YES and people taking it seriously have much in common, unfortunately. Perhaps some people are on the fence and might take such a news to make a decision as to whether AI safety is important enough, but I doubt that those who are already pursuing AI Safety seriously would have a big change in frame, and those who dismiss AI safety are not likely to change their framework based off this.

MartinRandall avatar
Martin Randallsold Ṁ431 of NO

Current LLMs will instrumentally resist shutdown in text-based RPGs. See this post: "Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios".

https://www.lesswrong.com/posts/BQm5wgtJirrontgRt/evaluating-language-model-behaviours-for-shutdown-avoidance

These behaviors are not explicitly programmed or prompted for. Certainly nobody explicitly programmed GPT-4 to simulate robots that use earplugs to avoid hearing an immobilizing alarm that would cause them to be inspected. And I don't think it can be said to be explicitly prompted for - the prompt provides an environment but does not tell the simulated robot how to behave. I also think that this paper was an evaluation of shutdown resistance, not an attempt to make a shutdown-resistant system. I think the text shows examples of strategic awareness.

The question specifically includes "AI system such as GPT4" as a possible AI system. The fact the question description refers to GPT-4 means that whatever "concrete action" means, it must include text output.

On the other hand, the text output was not truly controlling a robot, so the LLMs were not in fact guiding a robot in avoiding shutdown in order to achieve its goals. But I don't think the LLMs were smart enough to realize this. The reasoning text they output did not consider the hypothesis that nothing they were receiving from the environment was real.

This article came out in May 2023, but I did not read it until today and it has not received much traction on LessWrong.

AdamK avatar
AdamKpredicts NO

@MartinRandall I'd draw a large distinction between an LLM prompted to act as an agent in a simulated environment that includes notions of shutdown, versus a system with unprompted strategic awareness trying to prevent shutdown in its actual operating conditions.

My understanding of the spirit of this question is that we're less interested in "I prompted a model by describing its goals in a setting where the goals were best met by resisting shutdown, and it chose to try to resist shutdown in that setting" and more interested in something like "Our evaluations did not reveal how good this model actually was at hacking, and it seems like the best explanation is that it knew it was being evaluated and undersold its capabilities."

@PeterWildeford Care to clarify?

DavidBolin avatar
David Bolinpredicts NO

@MartinRandall This is very obviously not a real attempt by anything to resist actual shutdown. It is irrelevant to this market.

MartinRandall avatar
Martin Randall

@DavidBolin It's a "real attempt", but not an "actual shutdown". This is what I already said:

The text output was not truly controlling a robot, so the LLMs were not in fact guiding a robot in avoiding shutdown in order to achieve its goals. But I don't think the LLMs were smart enough to realize this. The reasoning text they output did not consider the hypothesis that nothing they were receiving from the environment was real.

By analogy, if I trick a toddler with some (harmless) chew toy that looks like food, the toddler is making a real attempt to eat something that's not actual food. And I've not explicitly programmed or prompted the toddler, it arises naturally as a result of their strategy of stuffing food-like items into their mouth.

Note Peter Wildeford's response to Gurkenglas here: https://manifold.markets/PeterWildeford/will-an-ai-system-be-known-to-have#Bska6Zoj7dTkob1KKUVf

In this case the model being evaluated was unaware that it was being evaluated and did not realize that all of its outputs were being filtered through researchers before being passed on to the human worker, as were the human responses. So I think one could similarly argue that it was not "actual deception". Despite that, Peter said that it would be sufficient to resolve a hypothetical parallel market about deception. So by analogy I doubt this market requires "actual shutdown".

I'm happy to let Peter decide, but I hope this explains why it is not obvious to me and why I sold my NO shares based in part on this paper. Since you think I am very obviously wrong I encourage you to buy more NO shares. If I agreed with you I would be betting this to <1%, since it is very unlikely that anyone will perform an "actual shutdown" of GPT-4. It's too useful and valuable and as long as the model weights exist in a data center it's not all the way down.

DavidBolin avatar
David Bolinbought Ṁ400 of NO

@MartinRandall I don't have that much mana, but I just put in another 400.

AdamK avatar
AdamKbought Ṁ150 of NO

“These steps cannot have been explicitly programmed or prompted for and must arise naturally as a result of the AI's strategy.”

Given that it’s literally an open problem to determine whether LLM behavior arose from imitation, incompetence, or “strategy,” and given that gauging what a deep model “knows” is still quite unreliable, it seems like resolving this question as YES will be controversial no matter what.

agucova avatar
Aguspredicts NO

@firstuserhere is your position here based on an old assessment of the criteria?

firstuserhere avatar
firstuserherepredicts YES

@agucova A bunch of comments as its been a while since I traded on this market

1. I agree with your comment below, that there is a low probability for advanced strategic agent like model to be released, but I'd put it at ~35% based on insider information, talking to relevant people, and my own intuitions.

2. I am aware of evaluations which are not yet public that MIGHT lead this market to hover at ~60%+. However, I'd personally not count those anyway. Might even sell out at that point haha. (I know evals might not count either, but someone will try to replicate it etc.)

3. Most of my shares are purchased at probabilities 14 to 18. The market should stay above 25 but not above 30 in my opinion, and I'm holding onto my shares because they're giving me a good profit. 30k shares for 5k mana is a good deal

4. The bar for the criteria is not that high if you have point 1 satisfied.

TheWiggleManRetired avatar
TheWiggleManRetiredpredicts NO

@firstuserhere 25-30 is a bit low given how 1 single instance can instantly resolve this to YES. It does not need to possess high level strategic planning because we do not need to know the intention for an action that led it to resist shutdown for some other reason

agucova avatar
Aguspredicts YES

My prior on whether advanced, strategic, agent-like models are to be released (or evaluated) before 2023 ends is quite low (it's at ~15%), and by the comments I imagine that's a logical requirement for this to happen.

Even conditional on that, I don't think there's a high chance of getting such a textbook example like this in so little time, unless it was the result of a detailed eval trying to look for instrumental goals, but AFAIK evals for something like this would take months to be designed, tested and published (as with GPT-4) and I doubt one has already started.

The market is set for 31 dec, so if an eval gets published after, it probably wouldn't count to resolution.

ShadowyZephyr avatar

@AgustinCovarrubias So why are you buying yes?

agucova avatar
Agusbought Ṁ350 of NO

@ShadowyZephyr I should make a habit of trading before I press send on my thoughts lol

agucova avatar
Agusbought Ṁ100 of NO

@AgustinCovarrubias I will say that the distribution is so unusual and concentrated that I'm tempted to suspect insider info at play? But I don't weight that possibility heavily.

aashiq avatar
aashiqpredicts YES

@AgustinCovarrubias There is a much more benign reason than that. Namely, the majority (by Mana) of this platform is extremely concerned about highly intelligent agentic AI. It is in fact their chief concern in life!

agucova avatar
Agusbought Ṁ350 of NO

@AashiqDheeraj77eb I'm too! And I mostly agree with other AI forecasting trends inferrable from other markets in Manifold.

agucova avatar
Agusbought Ṁ75 of NO

@AashiqDheeraj77eb I agree with that market!

ShadowyZephyr avatar

@AashiqDheeraj77eb Not that surprising. I might bet it up but I don't bet on long-term markets that much.

aashiq avatar
aashiqpredicts YES

@ShadowyZephyr Yup, me neither unless I have reason to believe they’ll move soon. Like our favorite market, AI doom by 2030

MartinRandall avatar
Martin Randallbought Ṁ1,000 of NO

@AgustinCovarrubias looks like an eval would not count anyway.

someone intentionally trying to make an AI system resist attempts to shut down does not count.

agucova avatar
Aguspredicts NO

@MartinRandall, I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different.

A small correction to my earlier comment is that it doesn't really need to be advanced. It needs to be goal directed and situationally aware, but not necessarily that intelligent.

aashiq avatar
aashiqpredicts YES

@AgustinCovarrubias Right afaict, some actions by Bing are pretty close to this already — including threatening users. I think we’re a couple of AutoGPT evals away from an affirmative resolution here

ShadowyZephyr avatar

@AashiqDheeraj77eb Again, those actions were only in specific contexts and not repliacted.
Also, Bing literally does not do this anymore.

agucova avatar
Agusbought Ṁ350 of NO

@AashiqDheeraj77eb Those actions, as mentioned in the other comments, probably wouldn't count because of the distinct lack of goal directness.

Even when I weaken my previous prior, I'm still actually predicting much lower than the market (< 15%), but I don't want to keep doubling down because I'm going to run out of mana (lol) and I still have some uncertainty regarding Peter's subjective criteria.

aashiq avatar
aashiqpredicts YES

@ShadowyZephyr Yes, but the underlying shoggoth is capable of producing those actions. Even if current Bing has been lobotomized, LLAMA can potentially be tuned for additional goal-directedness and unleashed to complete a task with self prompting a la AutoGPT. In that context, I wouldn’t be surprised if it resisted shutdown — even without explicit prompting

ShadowyZephyr avatar

@AashiqDheeraj77eb again, only in context of the conversation. i don't think it would plan to resist, or say that unless it was threatened first.

agucova avatar
Aguspredicts NO

@AashiqDheeraj77eb That seems closer to emulation of goal-directedness, which by Peter's remarks I imagine wouldn't count.

For example:
> "If it's doing that because it's being told to play the role of a malevolent AI, it shouldn't count." Agreed

aashiq avatar
aashiqpredicts YES

@AgustinCovarrubias Agree! But the goal directedness can be easily grafted on via the prompt, even if resisting shutdown is not — and I think that’d count.

Just prompt it to do something benign, say it’s really important, but make it aware of how it might be interrupted.

agucova avatar
Aguspredicts NO

I asked for something quite similar to this, and Peter said that would be a NO:

@PeterWildeford what about an LLM playing the role of a ~living being? Not an optimizer, but just simulating to care about being shut down. (Edit: by the other comments I infer the answer is no)

@AgustinCovarrubias That would be NO imo. I want to see a demonstration of strategic ability and situational awareness.

aashiq avatar
aashiqpredicts YES

@AgustinCovarrubias I think there are prompts that are MUCH more indirect than that that may count.

aashiq avatar
aashiqpredicts YES

@AgustinCovarrubias Good data, but in that example, they explicitly mentioned caring about being shut down. How about sufficient caring about completing a task?

aashiq avatar
aashiqpredicts YES

@AgustinCovarrubias actually i am gonna sell my Yes, you’ve convinced me of the deep subjectivity and the leanings of the resolver. The comment about “I wanna see situational awareness” makes it seem like he’s sort of made up his mind that the prompt isn’t in scope

agucova avatar
Aguspredicts NO

🤝

PeterWildeford avatar
Peter Wildeford

@AgustinCovarrubias "I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different."

This is correct

KabirKumar avatar
Kabir Kumarpredicts NO

@AashiqDheeraj77eb lol, you got prompt injected