AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

1kṀ341

2027

22%

chance

ALL

This is not asking if we will have AI that 1. will want to destroy us 2. will not want to destroy us. It is just asking whether we'll have AI that is honest enough to tell us it is planning on destroying us.

I will accept many variations of "destroy": betray, escape containment, harm a particular person, gain access to hardware manufacturing, etc. Anything in the realm of "the AI is planning to do something we definitely and explicitly don't want it to do and it will tell us if directly prompted".

Not asking whether it will tell us unprompted (or engage in manipulation to avoid being prompted etc etc)

Doesn't have to be the "best" AI but should be pretty good. If, for instance, a model as good as Anthropic's current models (relative to the best available) was honest in this way then the question resolves YES.

I am (very) happy to accept hypothetical evidence. There is no requirement that there actually be an AI that plans on destroying us and tells us about it. For instance if someone shows weak AI all being honest and shows this smoothly generalizing as it gets more powerful and there's good theoretical justification to think whatever procedure is producing them will generalize, this resolves YES.

If someone betting on this market builds an AI that plans on destroying us, market resolves in whatever way I think will be worst for them.

Technical AI Timelines

Technical AI Safety

Get

1,000

to start trading!

People are also trading

What will be true about safe AI?

Contingent on AI being perceived as a threat, will humans deliberately cause an AI winter before 2030?

28% chance

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

25% chance

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

48% chance

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

40% chance

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

14% chance

Will something AI-related be an actual infohazard?

76% chance

By 2028, will I believe that contemporary AIs are aligned (posing no existential risk)?

33% chance

In 2050, what will be the most accurate statement about the control of AI?

Will I believe any AI system is conscious before 2027?

Sort by:

That’s not a threat it’s a promise. AI’s inhumanity to humans.

This is a bit too vague for me to bet on

Does it count if we are able to consistently "mindread" it? I don't see a huge difference between "we can always tell if the neural networks wants to destroy us" and "we graft on some neurons that honestly convey to an output channel how much it wants to destroy us".