Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?
➕
Plus
21
Ṁ636
2028
14%
chance

Resolves YES if (and only if):

  • An artificial agent, not explicitly prompted to do so, plans to kill all humans, or to sieze control of the world from humanity, or to do things that would incidentally result in the death of all humans, such as building a Dyson sphere around the sun.

  • It is detected (using AI interpretability techniques) that the AI is planning this, or that it would plan this if circumstances were slightly different (eg. if the AI were connected to the internet rather than being airgapped).

  • I (the creator of this market) obtain credible evidence that this has occurred.

The goal of this market is to get an estimate of P(AI doom) while getting around the issue that Ṁ is useless if the world ends. This question is unfortunately therefore also coupled to progress in AI interpretability, but it should still provide useful signal. Eliezer Yudkowsky has sometimes complained that interpretability research would give a warning of doom, but not prevent doom. But at least a warning could be used to resolve this market, and this market can allow us to estimate the danger in advance.

Useful information about resolution criteria:

  • "AI interpretability" is meant to cast a very broad net. Mechanistic interpretability where one looks at the model's weights directly is included, as are techniques where the AI is trained with a loss function that rewards honesty. Basically any technique that can reveal information about what the AI is thinking counts here.

  • One constraint on the above is that the interpretability technique should be known to be reliable about reporting what the AI is actually thinking. The technique should have a track record of previous successes with a low error rate. In the absence of this, I'll hold off on resolving the market until more data becomes available.

  • The threat of world takeover has to be at least somewhat credible. For example, current LLMs could regurgitate the idea of building molecular nanotechnology, but they can't back that up with an actual ability to design molecular nanotechnology. Similarly, plans that involve hacking are not considered credible without an ability to do that kind of hacking.

  • If the AI would and could take over the world if it were connected to the internet, but it happens to not be connected to the internet, that counts as credible.

  • Asking an AI: "how would you take over the world?" and getting a viable answer back does not count. The intention to take control has to arise naturally from some non-contrived value function.

  • Nothing that an LLM has done so far even comes close to counting as a YES resolution to this question.

  • However, agents that are built on top of LLMs might in the future end up triggering a YES resolution to this market, if they significantly improve in both capability and agent-like-ness.

  • You can describe hypothetical scenarios in the comments, and I will tell you whether or not they would trigger a YES resolution.

As is standard for "before" markets, this market may resolve to YES before the close date. If the event in question doesn't happen, the market resolves to NO soon after the close date.

To keep resolution as objective as possible, I will not bet in this market.

EDIT: Added a bit on when I'd consider an interpretability technique to be reliable.

Get
Ṁ1,000
and
S3.00
Sort by:

One possible resolution is for a poor-quality interpretability technique to claim an AI is plotting like so. I think this is quite likely to happen, especially when there is a lot of flexibility to design both the AI and decide how to investigate it. Because of this, I’m voting yes.

@capybara This is a good point. If I know the interpretability technique is flawed, I won't resolve to yes, of course. If I suspect that it might be flawed, I'll hold off on resolving until further information is available. For me to not suspect a technique, it should have a previous record of making successful predictions about an AI's inner thoughts / planned actions, with a low error rate.

predictedYES

@Phi Okay! Then there won’t be an issue as long as you deciding it’s probably flawed isn’t caused by the same event that claims an AI to have those thoughts.

predictedYES

@capybara The remaining case: a technique with a good track record finds an AI with the thoughts matching this question, but you decide it was a bad application of the technique. In this case, it’s quite tricky to escape your prior brief that this event won’t happen.

Given the somewhat poor prospects for having interpretability be "essentially solved" for any model, I am inclined to doubt. But see the questions below for more details.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules