AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

This is not asking if we will have AI that 1. will want to destroy us 2. will not want to destroy us. It is just asking whether we'll have AI that is honest enough to tell us it is planning on destroying us.

I will accept many variations of "destroy": betray, escape containment, harm a particular person, gain access to hardware manufacturing, etc. Anything in the realm of "the AI is planning to do something we definitely and explicitly don't want it to do and it will tell us if directly prompted".

Not asking whether it will tell us unprompted (or engage in manipulation to avoid being prompted etc etc)

Doesn't have to be the "best" AI but should be pretty good. If, for instance, a model as good as Anthropic's current models (relative to the best available) was honest in this way then the question resolves YES.

I am (very) happy to accept hypothetical evidence. There is no requirement that there actually be an AI that plans on destroying us and tells us about it. For instance if someone shows weak AI all being honest and shows this smoothly generalizing as it gets more powerful and there's good theoretical justification to think whatever procedure is producing them will generalize, this resolves YES.

If someone betting on this market builds an AI that plans on destroying us, market resolves in whatever way I think will be worst for them.

Get แน€600 play money
Sort by:

This is a bit too vague for me to bet on

Does it count if we are able to consistently "mindread" it? I don't see a huge difference between "we can always tell if the neural networks wants to destroy us" and "we graft on some neurons that honestly convey to an output channel how much it wants to destroy us".

My laptop just told me it is shutting down to apply updates, which I definitely and explicitly don't want it to do.

More related questions