See https://manifold.markets/RyanGreenblatt/by-when-will-redwood-research-publi
I might resolve early if it seems sufficiently clear.
By the "ex-post results" I mean how good of an overall project it seems to have been ex-post in terms of a usage of labor, money, etc.
For context, I currently feel at least "very happy" about the ex-post results of the following Redwood Research projects/write-ups/quick blog posts:
- https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion
- https://www.lesswrong.com/posts/9Fdd9N7Escg3tcymb/preventing-language-models-from-hiding-their-reasoning
- https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
- https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed
- https://www.lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
- https://www.lesswrong.com/posts/rf66R4YsrCHgWx9RG/preventing-model-exfiltration-with-upload-limits
- Some of our advising in various cases.
I feel moderately happy about:
- https://www.alignmentforum.org/posts/inALbAqdx63KTaGgs/benchmarks-for-detecting-measurement-tampering-redwood
And I feel somewhat happy about:
- https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
And unhappy about a variety of work we did on interpretability and various thing we did more recently that didn't end up working out. For instance, I feel unhappy about somewhat lower-effort projects we did on debate, consistency losses, and steganography learned via RL.