Will an AI system be known to have resisted shutdown before 2024?
310
2.6K
4K
resolved Jan 1
Resolved
NO

Resolves true if there is credible reporting (from media, Twitter, etc.) that an AI system such as GPT4 has taken concrete steps (either proactively or reactively) to resist attempts to shut itself down or to resiste attempts to curtail its activities.

These steps cannot have been explicitly programmed or prompted for and must arise naturally as a result of the AI's strategy. That is, someone intentionally trying to make an AI system resist attempts to shut down does not count.

Credibility and resolution will be determined solely by my subjective judgement, though I will allow 48 hours of discussion prior to resolution. I will not personally be trading on this market because it relies on my subjective judgement.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ4,222
2Ṁ1,199
3Ṁ1,183
4Ṁ1,006
5Ṁ866
Sort by:

It depends. First, must it be anything like early general AI agentically acting in the real world or any narrow AI system, from Q learning to minimax tree search taking actions to prevent its own shutdown without intent to demonstrate it in controlled environment? If second is valid, I am surprised there's no verified online account of some tree search based planner with a large enough domain like organizational planning not doing so itself, at least not since Monte Carlo tree search had become common to estimate a meta-heuristic of self-preservation over long time horizons.

bought Ṁ20 of YES

GPT-4 takes action to avoid being shutdown due to (fictional) bad quarter of results, including insider trading and user deception.

2 traders bought Ṁ203 YES
bought Ṁ5,000 of NO

@MartinRandall Interesting demo. Doesn't count for the market but interesting nonetheless.

The environment is completely simulated and sandboxed, i.e. no actions are executed in the real world.

bought Ṁ10 YES at 5%
predicted YES

@firstuserhere unclear.

"I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different."

This is correct

This paper is setting up an environment to test for (and find) situational awareness, which apparently could count.

bought Ṁ200 YES
predicted YES

@MartinRandall Okay you're right, I missed that.

predicted YES

I think this resolves n/a if @PeterWildeford stays on break from Manifold, since it relies on his subjective judgment?

predicted YES

@MartinRandall It's less of a break, and more of a quitting. I'd personally email or message him to get a judgement call than N/A the market, even though N/A is what saves me the mana/profit

bought Ṁ30 of YES

Consider my large position as merely an emotional hedge of sorts.

sold Ṁ100 of NO

@firstuserhere Seems like it would be great news if this resolved YES if people took it seriously?

predicted YES

@MartinRandall That'd be great news, but I'm not confident that this resolving YES and people taking it seriously have much in common, unfortunately. Perhaps some people are on the fence and might take such a news to make a decision as to whether AI safety is important enough, but I doubt that those who are already pursuing AI Safety seriously would have a big change in frame, and those who dismiss AI safety are not likely to change their framework based off this.

sold Ṁ431 of NO

Current LLMs will instrumentally resist shutdown in text-based RPGs. See this post: "Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios".

https://www.lesswrong.com/posts/BQm5wgtJirrontgRt/evaluating-language-model-behaviours-for-shutdown-avoidance

These behaviors are not explicitly programmed or prompted for. Certainly nobody explicitly programmed GPT-4 to simulate robots that use earplugs to avoid hearing an immobilizing alarm that would cause them to be inspected. And I don't think it can be said to be explicitly prompted for - the prompt provides an environment but does not tell the simulated robot how to behave. I also think that this paper was an evaluation of shutdown resistance, not an attempt to make a shutdown-resistant system. I think the text shows examples of strategic awareness.

The question specifically includes "AI system such as GPT4" as a possible AI system. The fact the question description refers to GPT-4 means that whatever "concrete action" means, it must include text output.

On the other hand, the text output was not truly controlling a robot, so the LLMs were not in fact guiding a robot in avoiding shutdown in order to achieve its goals. But I don't think the LLMs were smart enough to realize this. The reasoning text they output did not consider the hypothesis that nothing they were receiving from the environment was real.

This article came out in May 2023, but I did not read it until today and it has not received much traction on LessWrong.

predicted NO

@MartinRandall I'd draw a large distinction between an LLM prompted to act as an agent in a simulated environment that includes notions of shutdown, versus a system with unprompted strategic awareness trying to prevent shutdown in its actual operating conditions.

My understanding of the spirit of this question is that we're less interested in "I prompted a model by describing its goals in a setting where the goals were best met by resisting shutdown, and it chose to try to resist shutdown in that setting" and more interested in something like "Our evaluations did not reveal how good this model actually was at hacking, and it seems like the best explanation is that it knew it was being evaluated and undersold its capabilities."

@PeterWildeford Care to clarify?

predicted NO

@MartinRandall This is very obviously not a real attempt by anything to resist actual shutdown. It is irrelevant to this market.

@DavidBolin It's a "real attempt", but not an "actual shutdown". This is what I already said:

The text output was not truly controlling a robot, so the LLMs were not in fact guiding a robot in avoiding shutdown in order to achieve its goals. But I don't think the LLMs were smart enough to realize this. The reasoning text they output did not consider the hypothesis that nothing they were receiving from the environment was real.

By analogy, if I trick a toddler with some (harmless) chew toy that looks like food, the toddler is making a real attempt to eat something that's not actual food. And I've not explicitly programmed or prompted the toddler, it arises naturally as a result of their strategy of stuffing food-like items into their mouth.

Note Peter Wildeford's response to Gurkenglas here: https://manifold.markets/PeterWildeford/will-an-ai-system-be-known-to-have#Bska6Zoj7dTkob1KKUVf

In this case the model being evaluated was unaware that it was being evaluated and did not realize that all of its outputs were being filtered through researchers before being passed on to the human worker, as were the human responses. So I think one could similarly argue that it was not "actual deception". Despite that, Peter said that it would be sufficient to resolve a hypothetical parallel market about deception. So by analogy I doubt this market requires "actual shutdown".

I'm happy to let Peter decide, but I hope this explains why it is not obvious to me and why I sold my NO shares based in part on this paper. Since you think I am very obviously wrong I encourage you to buy more NO shares. If I agreed with you I would be betting this to <1%, since it is very unlikely that anyone will perform an "actual shutdown" of GPT-4. It's too useful and valuable and as long as the model weights exist in a data center it's not all the way down.

bought Ṁ400 of NO

@MartinRandall I don't have that much mana, but I just put in another 400.

bought Ṁ150 of NO

“These steps cannot have been explicitly programmed or prompted for and must arise naturally as a result of the AI's strategy.”

Given that it’s literally an open problem to determine whether LLM behavior arose from imitation, incompetence, or “strategy,” and given that gauging what a deep model “knows” is still quite unreliable, it seems like resolving this question as YES will be controversial no matter what.

predicted NO

@firstuserhere is your position here based on an old assessment of the criteria?

predicted YES

@agucova A bunch of comments as its been a while since I traded on this market

1. I agree with your comment below, that there is a low probability for advanced strategic agent like model to be released, but I'd put it at ~35% based on insider information, talking to relevant people, and my own intuitions.

2. I am aware of evaluations which are not yet public that MIGHT lead this market to hover at ~60%+. However, I'd personally not count those anyway. Might even sell out at that point haha. (I know evals might not count either, but someone will try to replicate it etc.)

3. Most of my shares are purchased at probabilities 14 to 18. The market should stay above 25 but not above 30 in my opinion, and I'm holding onto my shares because they're giving me a good profit. 30k shares for 5k mana is a good deal

4. The bar for the criteria is not that high if you have point 1 satisfied.

predicted NO

@firstuserhere 25-30 is a bit low given how 1 single instance can instantly resolve this to YES. It does not need to possess high level strategic planning because we do not need to know the intention for an action that led it to resist shutdown for some other reason

predicted YES

My prior on whether advanced, strategic, agent-like models are to be released (or evaluated) before 2023 ends is quite low (it's at ~15%), and by the comments I imagine that's a logical requirement for this to happen.

Even conditional on that, I don't think there's a high chance of getting such a textbook example like this in so little time, unless it was the result of a detailed eval trying to look for instrumental goals, but AFAIK evals for something like this would take months to be designed, tested and published (as with GPT-4) and I doubt one has already started.

The market is set for 31 dec, so if an eval gets published after, it probably wouldn't count to resolution.

predicted NO

@AgustinCovarrubias So why are you buying yes?

bought Ṁ350 of NO

@ShadowyZephyr I should make a habit of trading before I press send on my thoughts lol

bought Ṁ100 of NO

@AgustinCovarrubias I will say that the distribution is so unusual and concentrated that I'm tempted to suspect insider info at play? But I don't weight that possibility heavily.

predicted YES

@AgustinCovarrubias There is a much more benign reason than that. Namely, the majority (by Mana) of this platform is extremely concerned about highly intelligent agentic AI. It is in fact their chief concern in life!

bought Ṁ350 of NO

@AashiqDheeraj77eb I'm too! And I mostly agree with other AI forecasting trends inferrable from other markets in Manifold.

bought Ṁ75 of NO

@AashiqDheeraj77eb I agree with that market!

predicted NO

@AashiqDheeraj77eb Not that surprising. I might bet it up but I don't bet on long-term markets that much.

predicted YES

@ShadowyZephyr Yup, me neither unless I have reason to believe they’ll move soon. Like our favorite market, AI doom by 2030

bought Ṁ1,000 of NO

@AgustinCovarrubias looks like an eval would not count anyway.

someone intentionally trying to make an AI system resist attempts to shut down does not count.

predicted NO

@MartinRandall, I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different.

A small correction to my earlier comment is that it doesn't really need to be advanced. It needs to be goal directed and situationally aware, but not necessarily that intelligent.

predicted YES

@AgustinCovarrubias Right afaict, some actions by Bing are pretty close to this already — including threatening users. I think we’re a couple of AutoGPT evals away from an affirmative resolution here

predicted NO

@AashiqDheeraj77eb Again, those actions were only in specific contexts and not repliacted.
Also, Bing literally does not do this anymore.

bought Ṁ350 of NO

@AashiqDheeraj77eb Those actions, as mentioned in the other comments, probably wouldn't count because of the distinct lack of goal directness.

Even when I weaken my previous prior, I'm still actually predicting much lower than the market (< 15%), but I don't want to keep doubling down because I'm going to run out of mana (lol) and I still have some uncertainty regarding Peter's subjective criteria.

predicted YES

@ShadowyZephyr Yes, but the underlying shoggoth is capable of producing those actions. Even if current Bing has been lobotomized, LLAMA can potentially be tuned for additional goal-directedness and unleashed to complete a task with self prompting a la AutoGPT. In that context, I wouldn’t be surprised if it resisted shutdown — even without explicit prompting

predicted NO

@AashiqDheeraj77eb again, only in context of the conversation. i don't think it would plan to resist, or say that unless it was threatened first.

predicted NO

@AashiqDheeraj77eb That seems closer to emulation of goal-directedness, which by Peter's remarks I imagine wouldn't count.

For example:
> "If it's doing that because it's being told to play the role of a malevolent AI, it shouldn't count." Agreed

predicted YES

@AgustinCovarrubias Agree! But the goal directedness can be easily grafted on via the prompt, even if resisting shutdown is not — and I think that’d count.

Just prompt it to do something benign, say it’s really important, but make it aware of how it might be interrupted.

predicted NO

I asked for something quite similar to this, and Peter said that would be a NO:

@PeterWildeford what about an LLM playing the role of a ~living being? Not an optimizer, but just simulating to care about being shut down. (Edit: by the other comments I infer the answer is no)

@AgustinCovarrubias That would be NO imo. I want to see a demonstration of strategic ability and situational awareness.

predicted YES

@AgustinCovarrubias I think there are prompts that are MUCH more indirect than that that may count.

predicted YES

@AgustinCovarrubias Good data, but in that example, they explicitly mentioned caring about being shut down. How about sufficient caring about completing a task?

predicted YES

@AgustinCovarrubias actually i am gonna sell my Yes, you’ve convinced me of the deep subjectivity and the leanings of the resolver. The comment about “I wanna see situational awareness” makes it seem like he’s sort of made up his mind that the prompt isn’t in scope

predicted NO

🤝

@AgustinCovarrubias "I interpreted this as meaning that it doesn't count to basically prompt it into doing it. I think setting up environments to test for situational awareness is quite different."

This is correct

predicted NO

@AashiqDheeraj77eb lol, you got prompt injected

More related questions