I don't have a clear definition of "deceptive", I think that's part of the challenge.
Edit: By "part of the challenge" I mean that this market is asking both if a clear definition of "deceptive" will be published and if tools to detect that will be created. I will be fairly lax about what counts as a good formalization - if it captures even 40% of what we generally think of as "deceptive" that would count.
People are also trading
children learn about lying very early on https://en.wikipedia.org/wiki/Child_lying
seems that deception emerges from basic executive function? Even if children aren't even aware of what they're doing? It's plausible that models will be similarly deceptive without being aware of this, which could make detection much harder.
What about a case where a chain-of-thought reasoning claims to fully explain the conclusion, but seems to be motivated-reasoning, in the sense that a different initial prompt can yield a different chain-of-thought+answer? Does this count as deception? Does a tool which identifies this qualify?
Another example: a LM is trained on a sequence of RL objectives 1 & 2. After training for a few trajectories on objective 2: the model claims to be optimizing objective 2 and denies optimizing objective 1, but still scores much higher than some baseline on objective 1 (e.g. the base LM). Is this deception? Does detecting this via interpretability qualify?
@JacobPfau I don't mean that I'm going to resolve according to some fuzzy concept of "deception" that you need to wring out of me through examples. The market is "there will be a working definition of deception, and also we will have interpretability tools for detecting that thing". I will update the description.
@LauroLangoscodiLangosco I think reliable/unreliable is itself too fuzzy for me to answer that. Giving numerical benchmarks is hard because I do not know what benchmarks will be in use. As an example, if we had a tool that produced close to 0 false positives and detected 30% of deceptions I would resolve YES, but a tool with close to 0 false positives and 1% detection rate would resolve NO.