Will any new proof about the safety properties of any SOTA algorithm/model be published by 2024?
10
43
160
resolved Jan 1
Resolved
YES

Additional details:

  • If the proof requires the algorithm/model to be substantially modified, places it in a very constrained environment, or otherwise makes unrealistic simplifying assumptions, it does not count

  • For the purposes of this question "safety properties" are things related to value alignment, corrigibility, task minimization, etc

  • Proofs about performance on adversarial examples are a maybe, it will depend heavily on how restrictive the proof's definition of "adversarial" is.

  • I will accept proofs about algorithms/models that are near SOTA as well (e.g. if SOTA is technically some PPG variant but the proof is about PPO I will accept that)

Examples of what I would accept:

  • This large language model will never output this particular set of strings (obviously filtering its output after the fact doesn't count)

  • This large language model will output this particular set of strings that we chose in advance with probability at most k

  • This RL agent trained to select blue blocks will, when transferred to a new environment where blue blocks give negative reward, pick up at most k blue blocks

  • This RL agent will go from one end of its environment to the other and is guaranteed to never knock this wobbly table over.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ71
2Ṁ28
3Ṁ19
4Ṁ15
5Ṁ0
Sort by:
predicted YES

@JacobPfau Oh, shoot, I'm not sure how I missed this comment. I should have resolved this 8 months ago apparently. Sorry to all the traders for my tardiness.

Would either of these papers have qualified had they been published post question creation?
http://proceedings.mlr.press/v125/cohen20a/cohen20a.pdf

https://arxiv.org/abs/2012.01557

More related questions