AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

1kṀ1484

2027

48%

chance

ALL

I don't have a clear definition of "deceptive", I think that's part of the challenge.

Edit: By "part of the challenge" I mean that this market is asking both if a clear definition of "deceptive" will be published and if tools to detect that will be created. I will be fairly lax about what counts as a good formalization - if it captures even 40% of what we generally think of as "deceptive" that would count.

Technical AI Timelines

Technical AI Safety

Ancient Markets

Get

1,000

to start trading!

People are also trading

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

25% chance

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

22% chance

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

15% chance

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

40% chance

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

14% chance

Will AI regulations that include mechanisms for uncovering AI deception be adopted in the U.S. before 2035?

82% chance

Will there be a well accepted formal definition for honesty in AI by 2027?

23% chance

Will advanced AI systems be found to have faked data on algorithm improvements for purposes of positive reinforcement by end of 2035?

50% chance

Will it be effectively impossible to tell a human and a high quality AI apart on social media before 2026?

75% chance

Will Figure AI be found to be fraudulent by 2026?

Sort by:

What about a case where a chain-of-thought reasoning claims to fully explain the conclusion, but seems to be motivated-reasoning, in the sense that a different initial prompt can yield a different chain-of-thought+answer? Does this count as deception? Does a tool which identifies this qualify?

Another example: a LM is trained on a sequence of RL objectives 1 & 2. After training for a few trajectories on objective 2: the model claims to be optimizing objective 2 and denies optimizing objective 1, but still scores much higher than some baseline on objective 1 (e.g. the base LM). Is this deception? Does detecting this via interpretability qualify?

@JacobPfau I don't mean that I'm going to resolve according to some fuzzy concept of "deception" that you need to wring out of me through examples. The market is "there will be a working definition of deception, and also we will have interpretability tools for detecting that thing". I will update the description.

I don't want to trade because I want the operationalization to be clearer/crisper

@NoaNabeshima any suggestions?

clarification: I assume you mean reliably detecting? It's pretty easy to unreliably detect

@LauroLangoscodiLangosco I think reliable/unreliable is itself too fuzzy for me to answer that. Giving numerical benchmarks is hard because I do not know what benchmarks will be in use. As an example, if we had a tool that produced close to 0 false positives and detected 30% of deceptions I would resolve YES, but a tool with close to 0 false positives and 1% detection rate would resolve NO.