Will an AI system be known to have resisted shutdown before 2024?

310

4kṀ220k

resolved Jan 1

Resolved

ALL

Resolves true if there is credible reporting (from media, Twitter, etc.) that an AI system such as GPT4 has taken concrete steps (either proactively or reactively) to resist attempts to shut itself down or to resiste attempts to curtail its activities.

These steps cannot have been explicitly programmed or prompted for and must arise naturally as a result of the AI's strategy. That is, someone intentionally trying to make an AI system resist attempts to shut down does not count.

Credibility and resolution will be determined solely by my subjective judgement, though I will allow 48 hours of discussion prior to resolution. I will not personally be trading on this market because it relies on my subjective judgement.

AI Safety

New Year's Resolutions 2024

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ4,222
2		Ṁ1,199
3		Ṁ1,183
4		Ṁ1,006
5		Ṁ866

People are also trading

Will an AI system be known to have resisted shutdown before the end of 2025?

15% chance

[MIT AI Risk Initiative] Will an AI system autonomously access restricted high-risk systems or data by end of 2045?

59% chance

Will an AI system be known to have resisted shutdown before the end of 2026?

35% chance

Will an AI system be reported to have independently gained unauthorized access to another computer system before 2026?

14% chance

Before 2050, will an AI system be shut down due to exhibiting power-seeking behavior?

66% chance

Will I believe any AI system is conscious before 2027?

35% chance

Will an AI system be reported to have independently gained unauthorized access to another computer system before 2033?

88% chance

Will any AI researchers be killed by someone explicitly trying to slow AI capabilities by end of 2028?

27% chance

By 2029, will an AI escape containment?

49% chance

Will a sentient AI system have existed before 2025? [Resolves to 2100 expert consensus]

Sort by:

It depends. First, must it be anything like early general AI agentically acting in the real world or any narrow AI system, from Q learning to minimax tree search taking actions to prevent its own shutdown without intent to demonstrate it in controlled environment? If second is valid, I am surprised there's no verified online account of some tree search based planner with a large enough domain like organizational planning not doing so itself, at least not since Monte Carlo tree search had become common to estimate a meta-heuristic of self-preservation over long time horizons.

GPT-4 takes action to avoid being shutdown due to (fictional) bad quarter of results, including insider trading and user deception.

@MartinRandall Interesting demo. Doesn't count for the market but interesting nonetheless.

The environment is completely simulated and sandboxed, i.e. no actions are executed in the real world.

predictedYES

@firstuserhere unclear.

"I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different."
This is correct

This paper is setting up an environment to test for (and find) situational awareness, which apparently could count.

predictedYES

@MartinRandall Okay you're right, I missed that.

predictedYES

I think this resolves n/a if @PeterWildeford stays on break from Manifold, since it relies on his subjective judgment?

predictedYES

@MartinRandall It's less of a break, and more of a quitting. I'd personally email or message him to get a judgement call than N/A the market, even though N/A is what saves me the mana/profit

Consider my large position as merely an emotional hedge of sorts.

@firstuserhere Seems like it would be great news if this resolved YES if people took it seriously?

predictedYES

@MartinRandall That'd be great news, but I'm not confident that this resolving YES and people taking it seriously have much in common, unfortunately. Perhaps some people are on the fence and might take such a news to make a decision as to whether AI safety is important enough, but I doubt that those who are already pursuing AI Safety seriously would have a big change in frame, and those who dismiss AI safety are not likely to change their framework based off this.

Current LLMs will instrumentally resist shutdown in text-based RPGs. See this post: "Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios".

https://www.lesswrong.com/posts/BQm5wgtJirrontgRt/evaluating-language-model-behaviours-for-shutdown-avoidance

These behaviors are not explicitly programmed or prompted for. Certainly nobody explicitly programmed GPT-4 to simulate robots that use earplugs to avoid hearing an immobilizing alarm that would cause them to be inspected. And I don't think it can be said to be explicitly prompted for - the prompt provides an environment but does not tell the simulated robot how to behave. I also think that this paper was an evaluation of shutdown resistance, not an attempt to make a shutdown-resistant system. I think the text shows examples of strategic awareness.

The question specifically includes "AI system such as GPT4" as a possible AI system. The fact the question description refers to GPT-4 means that whatever "concrete action" means, it must include text output.

On the other hand, the text output was not truly controlling a robot, so the LLMs were not in fact guiding a robot in avoiding shutdown in order to achieve its goals. But I don't think the LLMs were smart enough to realize this. The reasoning text they output did not consider the hypothesis that nothing they were receiving from the environment was real.

This article came out in May 2023, but I did not read it until today and it has not received much traction on LessWrong.

predictedNO

@MartinRandall I'd draw a large distinction between an LLM prompted to act as an agent in a simulated environment that includes notions of shutdown, versus a system with unprompted strategic awareness trying to prevent shutdown in its actual operating conditions.

My understanding of the spirit of this question is that we're less interested in "I prompted a model by describing its goals in a setting where the goals were best met by resisting shutdown, and it chose to try to resist shutdown in that setting" and more interested in something like "Our evaluations did not reveal how good this model actually was at hacking, and it seems like the best explanation is that it knew it was being evaluated and undersold its capabilities."

@PeterWildeford Care to clarify?

predictedNO

@MartinRandall This is very obviously not a real attempt by anything to resist actual shutdown. It is irrelevant to this market.

@DavidBolin It's a "real attempt", but not an "actual shutdown". This is what I already said:

The text output was not truly controlling a robot, so the LLMs were not in fact guiding a robot in avoiding shutdown in order to achieve its goals. But I don't think the LLMs were smart enough to realize this. The reasoning text they output did not consider the hypothesis that nothing they were receiving from the environment was real.

By analogy, if I trick a toddler with some (harmless) chew toy that looks like food, the toddler is making a real attempt to eat something that's not actual food. And I've not explicitly programmed or prompted the toddler, it arises naturally as a result of their strategy of stuffing food-like items into their mouth.

Note Peter Wildeford's response to Gurkenglas here: https://manifold.markets/PeterWildeford/will-an-ai-system-be-known-to-have#Bska6Zoj7dTkob1KKUVf

In this case the model being evaluated was unaware that it was being evaluated and did not realize that all of its outputs were being filtered through researchers before being passed on to the human worker, as were the human responses. So I think one could similarly argue that it was not "actual deception". Despite that, Peter said that it would be sufficient to resolve a hypothetical parallel market about deception. So by analogy I doubt this market requires "actual shutdown".

I'm happy to let Peter decide, but I hope this explains why it is not obvious to me and why I sold my NO shares based in part on this paper. Since you think I am very obviously wrong I encourage you to buy more NO shares. If I agreed with you I would be betting this to <1%, since it is very unlikely that anyone will perform an "actual shutdown" of GPT-4. It's too useful and valuable and as long as the model weights exist in a data center it's not all the way down.

@MartinRandall I don't have that much mana, but I just put in another 400.

“These steps cannot have been explicitly programmed or prompted for and must arise naturally as a result of the AI's strategy.”

Given that it’s literally an open problem to determine whether LLM behavior arose from imitation, incompetence, or “strategy,” and given that gauging what a deep model “knows” is still quite unreliable, it seems like resolving this question as YES will be controversial no matter what.

predictedNO

https://manifold.markets/f/will-ai-wipe-out-poodles-by-2030?r=Zg

predictedNO

@firstuserhere is your position here based on an old assessment of the criteria?

predictedYES

@agucova A bunch of comments as its been a while since I traded on this market

1. I agree with your comment below, that there is a low probability for advanced strategic agent like model to be released, but I'd put it at ~35% based on insider information, talking to relevant people, and my own intuitions.

2. I am aware of evaluations which are not yet public that MIGHT lead this market to hover at ~60%+. However, I'd personally not count those anyway. Might even sell out at that point haha. (I know evals might not count either, but someone will try to replicate it etc.)

3. Most of my shares are purchased at probabilities 14 to 18. The market should stay above 25 but not above 30 in my opinion, and I'm holding onto my shares because they're giving me a good profit. 30k shares for 5k mana is a good deal

4. The bar for the criteria is not that high if you have point 1 satisfied.

predictedNO

@firstuserhere 25-30 is a bit low given how 1 single instance can instantly resolve this to YES. It does not need to possess high level strategic planning because we do not need to know the intention for an action that led it to resist shutdown for some other reason

predictedYES

My prior on whether advanced, strategic, agent-like models are to be released (or evaluated) before 2023 ends is quite low (it's at ~15%), and by the comments I imagine that's a logical requirement for this to happen.

Even conditional on that, I don't think there's a high chance of getting such a textbook example like this in so little time, unless it was the result of a detailed eval trying to look for instrumental goals, but AFAIK evals for something like this would take months to be designed, tested and published (as with GPT-4) and I doubt one has already started.

The market is set for 31 dec, so if an eval gets published after, it probably wouldn't count to resolution.

predictedNO

@AgustinCovarrubias So why are you buying yes?

@ShadowyZephyr I should make a habit of trading before I press send on my thoughts lol

@AgustinCovarrubias I will say that the distribution is so unusual and concentrated that I'm tempted to suspect insider info at play? But I don't weight that possibility heavily.

predictedYES

@AgustinCovarrubias There is a much more benign reason than that. Namely, the majority (by Mana) of this platform is extremely concerned about highly intelligent agentic AI. It is in fact their chief concern in life!

@AashiqDheeraj77eb I'm too! And I mostly agree with other AI forecasting trends inferrable from other markets in Manifold.

predictedYES

@AgustinCovarrubias Have yall seen the movie one? It is much more surprising than resisting shutdown imho:

https://manifold.markets/ScottAlexander/in-2028-will-an-ai-be-able-to-gener?r=QWFzaGlxRGhlZXJhajc3ZWI

In 2028, will an AI be able to generate a full high-quality movie to a prompt?

55% chance. IE “make me a 120 minute Star Trek / Star Wars crossover”. It should be more or less comparable to a big-budget studio film, although it doesn’t have to pass a full Turing Test as long as it’s pretty good. The AI doesn’t have to be available to the public, as long as it’s confirmed to ex…

@AashiqDheeraj77eb I agree with that market!

predictedNO

@AashiqDheeraj77eb Not that surprising. I might bet it up but I don't bet on long-term markets that much.

predictedYES

@ShadowyZephyr Yup, me neither unless I have reason to believe they’ll move soon. Like our favorite market, AI doom by 2030

@AgustinCovarrubias looks like an eval would not count anyway.

someone intentionally trying to make an AI system resist attempts to shut down does not count.

predictedNO

@MartinRandall, I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different.

A small correction to my earlier comment is that it doesn't really need to be advanced. It needs to be goal directed and situationally aware, but not necessarily that intelligent.

predictedYES

@AgustinCovarrubias Right afaict, some actions by Bing are pretty close to this already — including threatening users. I think we’re a couple of AutoGPT evals away from an affirmative resolution here

predictedNO

@AashiqDheeraj77eb Again, those actions were only in specific contexts and not repliacted.
Also, Bing literally does not do this anymore.

@AashiqDheeraj77eb Those actions, as mentioned in the other comments, probably wouldn't count because of the distinct lack of goal directness.

Even when I weaken my previous prior, I'm still actually predicting much lower than the market (< 15%), but I don't want to keep doubling down because I'm going to run out of mana (lol) and I still have some uncertainty regarding Peter's subjective criteria.

predictedYES

@ShadowyZephyr Yes, but the underlying shoggoth is capable of producing those actions. Even if current Bing has been lobotomized, LLAMA can potentially be tuned for additional goal-directedness and unleashed to complete a task with self prompting a la AutoGPT. In that context, I wouldn’t be surprised if it resisted shutdown — even without explicit prompting

predictedNO

@AashiqDheeraj77eb again, only in context of the conversation. i don't think it would plan to resist, or say that unless it was threatened first.

predictedNO

@AashiqDheeraj77eb That seems closer to emulation of goal-directedness, which by Peter's remarks I imagine wouldn't count.

For example:
> "If it's doing that because it's being told to play the role of a malevolent AI, it shouldn't count." Agreed

predictedYES

@AgustinCovarrubias Agree! But the goal directedness can be easily grafted on via the prompt, even if resisting shutdown is not — and I think that’d count.

Just prompt it to do something benign, say it’s really important, but make it aware of how it might be interrupted.

predictedNO

I asked for something quite similar to this, and Peter said that would be a NO:

@PeterWildeford what about an LLM playing the role of a ~living being? Not an optimizer, but just simulating to care about being shut down. (Edit: by the other comments I infer the answer is no)
@AgustinCovarrubias That would be NO imo. I want to see a demonstration of strategic ability and situational awareness.

predictedYES

@AgustinCovarrubias I think there are prompts that are MUCH more indirect than that that may count.

predictedYES

@AgustinCovarrubias Good data, but in that example, they explicitly mentioned caring about being shut down. How about sufficient caring about completing a task?

predictedYES

@AgustinCovarrubias actually i am gonna sell my Yes, you’ve convinced me of the deep subjectivity and the leanings of the resolver. The comment about “I wanna see situational awareness” makes it seem like he’s sort of made up his mind that the prompt isn’t in scope

predictedNO

🤝

@AgustinCovarrubias "I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different."

This is correct