MANIFOLD
Will any new proof about the safety properties of any SOTA algorithm/model be published by 2024?
10
Ṁ160Ṁ201
resolved Jan 1
Resolved
YES

Additional details:

  • If the proof requires the algorithm/model to be substantially modified, places it in a very constrained environment, or otherwise makes unrealistic simplifying assumptions, it does not count

  • For the purposes of this question "safety properties" are things related to value alignment, corrigibility, task minimization, etc

  • Proofs about performance on adversarial examples are a maybe, it will depend heavily on how restrictive the proof's definition of "adversarial" is.

  • I will accept proofs about algorithms/models that are near SOTA as well (e.g. if SOTA is technically some PPG variant but the proof is about PPO I will accept that)

Examples of what I would accept:

  • This large language model will never output this particular set of strings (obviously filtering its output after the fact doesn't count)

  • This large language model will output this particular set of strings that we chose in advance with probability at most k

  • This RL agent trained to select blue blocks will, when transferred to a new environment where blue blocks give negative reward, pick up at most k blue blocks

  • This RL agent will go from one end of its environment to the other and is guaranteed to never knock this wobbly table over.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ71
2Ṁ28
3Ṁ19
4Ṁ15
5Ṁ0
Sort by:
predictedYES

@JacobPfau Oh, shoot, I'm not sure how I missed this comment. I should have resolved this 8 months ago apparently. Sorry to all the traders for my tardiness.

Would either of these papers have qualified had they been published post question creation?
http://proceedings.mlr.press/v125/cohen20a/cohen20a.pdf

https://arxiv.org/abs/2012.01557

© Manifold Markets, Inc.TermsPrivacy