Will any new proof about the safety properties of any SOTA algorithm/model be published by 2024?

160Ṁ201

resolved Jan 1

Resolved

YES

ALL

Additional details:

If the proof requires the algorithm/model to be substantially modified, places it in a very constrained environment, or otherwise makes unrealistic simplifying assumptions, it does not count
For the purposes of this question "safety properties" are things related to value alignment, corrigibility, task minimization, etc
Proofs about performance on adversarial examples are a maybe, it will depend heavily on how restrictive the proof's definition of "adversarial" is.
I will accept proofs about algorithms/models that are near SOTA as well (e.g. if SOTA is technically some PPG variant but the proof is about PPO I will accept that)

Examples of what I would accept:

This large language model will never output this particular set of strings (obviously filtering its output after the fact doesn't count)
This large language model will output this particular set of strings that we chose in advance with probability at most k
This RL agent trained to select blue blocks will, when transferred to a new environment where blue blocks give negative reward, pick up at most k blue blocks
This RL agent will go from one end of its environment to the other and is guaranteed to never knock this wobbly table over.

Get

1,000

to start trading!