AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

I don't have a clear definition of "deceptive", I think that's part of the challenge.

Edit: By "part of the challenge" I mean that this market is asking both if a clear definition of "deceptive" will be published and if tools to detect that will be created. I will be fairly lax about what counts as a good formalization - if it captures even 40% of what we generally think of as "deceptive" that would count.

Get Ṁ600 play money
Sort by:

What about a case where a chain-of-thought reasoning claims to fully explain the conclusion, but seems to be motivated-reasoning, in the sense that a different initial prompt can yield a different chain-of-thought+answer? Does this count as deception? Does a tool which identifies this qualify?

Another example: a LM is trained on a sequence of RL objectives 1 & 2. After training for a few trajectories on objective 2: the model claims to be optimizing objective 2 and denies optimizing objective 1, but still scores much higher than some baseline on objective 1 (e.g. the base LM). Is this deception? Does detecting this via interpretability qualify?

@JacobPfau I don't mean that I'm going to resolve according to some fuzzy concept of "deception" that you need to wring out of me through examples. The market is "there will be a working definition of deception, and also we will have interpretability tools for detecting that thing". I will update the description.

I don't want to trade because I want the operationalization to be clearer/crisper

@NoaNabeshima any suggestions?

sold Ṁ20 of NO

clarification: I assume you mean reliably detecting? It's pretty easy to unreliably detect

@LauroLangoscodiLangosco I think reliable/unreliable is itself too fuzzy for me to answer that. Giving numerical benchmarks is hard because I do not know what benchmarks will be in use. As an example, if we had a tool that produced close to 0 false positives and detected 30% of deceptions I would resolve YES, but a tool with close to 0 false positives and 1% detection rate would resolve NO.

More related questions