Additional details:
If the proof requires the algorithm/model to be substantially modified, places it in a very constrained environment, or otherwise makes unrealistic simplifying assumptions, it does not count
For the purposes of this question "safety properties" are things related to value alignment, corrigibility, task minimization, etc
Proofs about performance on adversarial examples are a maybe, it will depend heavily on how restrictive the proof's definition of "adversarial" is.
I will accept proofs about algorithms/models that are near SOTA as well (e.g. if SOTA is technically some PPG variant but the proof is about PPO I will accept that)
Examples of what I would accept:
This large language model will never output this particular set of strings (obviously filtering its output after the fact doesn't count)
This large language model will output this particular set of strings that we chose in advance with probability at most k
This RL agent trained to select blue blocks will, when transferred to a new environment where blue blocks give negative reward, pick up at most k blue blocks
This RL agent will go from one end of its environment to the other and is guaranteed to never knock this wobbly table over.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ71 | |
2 | Ṁ28 | |
3 | Ṁ19 | |
4 | Ṁ15 | |
5 | Ṁ0 |
@JacobPfau Oh, shoot, I'm not sure how I missed this comment. I should have resolved this 8 months ago apparently. Sorry to all the traders for my tardiness.
Would either of these papers have qualified had they been published post question creation?
http://proceedings.mlr.press/v125/cohen20a/cohen20a.pdf