Will I feel "very happy" about the ex-post results of the Redwood Research sandbagging project on 2025-04-01?
Basic
5
Ṁ156
Apr 2
85%
chance

See https://manifold.markets/RyanGreenblatt/by-when-will-redwood-research-publi

I might resolve early if it seems sufficiently clear.

By the "ex-post results" I mean how good of an overall project it seems to have been ex-post in terms of a usage of labor, money, etc.

For context, I currently feel at least "very happy" about the ex-post results of the following Redwood Research projects/write-ups/quick blog posts:

- https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion
- https://www.lesswrong.com/posts/9Fdd9N7Escg3tcymb/preventing-language-models-from-hiding-their-reasoning
- https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
- https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed
- https://www.lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
- https://www.lesswrong.com/posts/rf66R4YsrCHgWx9RG/preventing-model-exfiltration-with-upload-limits
- Some of our advising in various cases.

I feel moderately happy about:

- https://www.alignmentforum.org/posts/inALbAqdx63KTaGgs/benchmarks-for-detecting-measurement-tampering-redwood

And I feel somewhat happy about:

- https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

And unhappy about a variety of work we did on interpretability and various thing we did more recently that didn't end up working out. For instance, I feel unhappy about somewhat lower-effort projects we did on debate, consistency losses, and steganography learned via RL.

Get
Ṁ1,000
and
S3.00
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules