Will I think that defining whether a model is trusted or untrusted is better done by measuring progress on a benchmark specifically designed to measure deception-related capabilities (i.e. choosing outputs so that they don't leak the goal, internally reasoning about how to not get caught by the oversight process...), as opposed to measuring progress on a benchmark with hard game theory/strategy questions which might but don't have to be related to deception (evaluated with no CoT) given the same budget for dataset construction?
Resolves on Dec 13, 2024?
"Best" means best to get actual epistemic evidence (in my opinion), not what is best from a public relations perspective.
Is about the distinction described here: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
If it is very unclear to me which benchmark in the pair this market considers is better (e.g. because some other very different method for determining trusted vs untrusted is much better than both), resolves to NA.