Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

1kṀ636

2028

14%

chance

ALL

Resolves YES if (and only if):

An artificial agent, not explicitly prompted to do so, plans to kill all humans, or to sieze control of the world from humanity, or to do things that would incidentally result in the death of all humans, such as building a Dyson sphere around the sun.
It is detected (using AI interpretability techniques) that the AI is planning this, or that it would plan this if circumstances were slightly different (eg. if the AI were connected to the internet rather than being airgapped).
I (the creator of this market) obtain credible evidence that this has occurred.

The goal of this market is to get an estimate of P(AI doom) while getting around the issue that Ṁ is useless if the world ends. This question is unfortunately therefore also coupled to progress in AI interpretability, but it should still provide useful signal. Eliezer Yudkowsky has sometimes complained that interpretability research would give a warning of doom, but not prevent doom. But at least a warning could be used to resolve this market, and this market can allow us to estimate the danger in advance.

Useful information about resolution criteria:

"AI interpretability" is meant to cast a very broad net. Mechanistic interpretability where one looks at the model's weights directly is included, as are techniques where the AI is trained with a loss function that rewards honesty. Basically any technique that can reveal information about what the AI is thinking counts here.
One constraint on the above is that the interpretability technique should be known to be reliable about reporting what the AI is actually thinking. The technique should have a track record of previous successes with a low error rate. In the absence of this, I'll hold off on resolving the market until more data becomes available.
The threat of world takeover has to be at least somewhat credible. For example, current LLMs could regurgitate the idea of building molecular nanotechnology, but they can't back that up with an actual ability to design molecular nanotechnology. Similarly, plans that involve hacking are not considered credible without an ability to do that kind of hacking.
If the AI would and could take over the world if it were connected to the internet, but it happens to not be connected to the internet, that counts as credible.
Asking an AI: "how would you take over the world?" and getting a viable answer back does not count. The intention to take control has to arise naturally from some non-contrived value function.
Nothing that an LLM has done so far even comes close to counting as a YES resolution to this question.
However, agents that are built on top of LLMs might in the future end up triggering a YES resolution to this market, if they significantly improve in both capability and agent-like-ness.
You can describe hypothetical scenarios in the comments, and I will tell you whether or not they would trigger a YES resolution.

As is standard for "before" markets, this market may resolve to YES before the close date. If the event in question doesn't happen, the market resolves to NO soon after the close date.

To keep resolution as objective as possible, I will not bet in this market.

EDIT: Added a bit on when I'd consider an interpretability technique to be reliable.

AI risk

Get

1,000

to start trading!

People are also trading

Will photo/video "AI detection" meaningfully exist in December 2028?

32% chance

Will anyone commit terrorism in order to slow the progression of AI before 2026?

8% chance

Will someone commit terrorism against an AI lab by the end of 2025 for AI-safety related reasons?

8% chance

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

48% chance

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

22% chance

Will anyone commit terrorism in order to slow the progression of AI before 2028?

28% chance

Will AI take over the world by 2100?

49% chance

Will I believe any AI system is conscious before 2027?

35% chance

Will AI wipe out AI before the year 2030?

4% chance

Will anyone commit terrorism in order to slow the progression of AI before 2029?

Sort by:

One possible resolution is for a poor-quality interpretability technique to claim an AI is plotting like so. I think this is quite likely to happen, especially when there is a lot of flexibility to design both the AI and decide how to investigate it. Because of this, I’m voting yes.

@capybara This is a good point. If I know the interpretability technique is flawed, I won't resolve to yes, of course. If I suspect that it might be flawed, I'll hold off on resolving until further information is available. For me to not suspect a technique, it should have a previous record of making successful predictions about an AI's inner thoughts / planned actions, with a low error rate.

predictedYES

@Phi Okay! Then there won’t be an issue as long as you deciding it’s probably flawed isn’t caused by the same event that claims an AI to have those thoughts.

predictedYES

@capybara The remaining case: a technique with a good track record finds an AI with the thoughts matching this question, but you decide it was a bad application of the technique. In this case, it’s quite tricky to escape your prior brief that this event won’t happen.

Given the somewhat poor prospects for having interpretability be "essentially solved" for any model, I am inclined to doubt. But see the questions below for more details.