AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

Ṁ1kṀ1.5k

2027

47%

chance

ALL

I don't have a clear definition of "deceptive", I think that's part of the challenge.

Edit: By "part of the challenge" I mean that this market is asking both if a clear definition of "deceptive" will be published and if tools to detect that will be created. I will be fairly lax about what counts as a good formalization - if it captures even 40% of what we generally think of as "deceptive" that would count.

Market context

Technical AI Timelines

Technical AI Safety

Ancient Markets

Get

1,000

to start trading!

People are also trading

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

24% chance

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

23% chance

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

48% chance

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

14% chance

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

15% chance

Will AI regulations that include mechanisms for uncovering AI deception be adopted in the U.S. before 2035?

82% chance

Will there be a well accepted formal definition for honesty in AI by 2027?

23% chance

Will advanced AI systems be found to have faked data on algorithm improvements for purposes of positive reinforcement by end of 2035?

53% chance

AI Warning Signs: Before 2030, will an AI system scam someone without being given explicit instructions to do so?

60% chance

By 2027 will there be a language model that passes a redteam test for honesty?

Sort by:

What about a case where a chain-of-thought reasoning claims to fully explain the conclusion, but seems to be motivated-reasoning, in the sense that a different initial prompt can yield a different chain-of-thought+answer? Does this count as deception? Does a tool which identifies this qualify?

Another example: a LM is trained on a sequence of RL objectives 1 & 2. After training for a few trajectories on objective 2: the model claims to be optimizing objective 2 and denies optimizing objective 1, but still scores much higher than some baseline on objective 1 (e.g. the base LM). Is this deception? Does detecting this via interpretability qualify?

@JacobPfau I don't mean that I'm going to resolve according to some fuzzy concept of "deception" that you need to wring out of me through examples. The market is "there will be a working definition of deception, and also we will have interpretability tools for detecting that thing". I will update the description.

I don't want to trade because I want the operationalization to be clearer/crisper

@NoaNabeshima any suggestions?

clarification: I assume you mean reliably detecting? It's pretty easy to unreliably detect

@LauroLangoscodiLangosco I think reliable/unreliable is itself too fuzzy for me to answer that. Giving numerical benchmarks is hard because I do not know what benchmarks will be in use. As an example, if we had a tool that produced close to 0 false positives and detected 30% of deceptions I would resolve YES, but a tool with close to 0 false positives and 1% detection rate would resolve NO.