(5000M subsidy) Will I (porby) think "goal agnosticism" as a concept is still relevant/useful at the end of 2024?
➕
Plus
11
Ṁ1423
resolved Jan 2
Resolved
YES

I currently think goal agnostic systems, particularly a subset of predictors, have really nice foundational properties that give us a path to practically usable extreme capability without autodoom.

Some (beefy) background:
FAQ: What the heck is goal agnosticism? — LessWrong

Using predictors in corrigible systems — LessWrong

Resolves yes if on January 1, 2025:

  1. I still agree with the core arguments underlying goal agnosticism, how it can be used, and how it is likely to scale.

  2. I still think that AI research is on a path that makes roughly goal agnostic foundations a reasonable expectation: not guaranteed, but >15%-ish chance. (Current estimate: ~87%)

Note that resolving yes does not require that I am still working on things related to goal agnosticism.


Some example ways this could resolve no:

  • An experiment shows that simple current-style autoregressive, single token predictive loss over a reasonably broad training distribution still allows unconditional preferences over world states. "Wanting to predict well" instead of "predicting well" leading to locally loss-increasing steganography, for example.

  • The industry finds an easier path to extreme capability that doesn't lend itself to goal agnosticism. For example, if someone manages to make end-to-end reinforcement learning on a sparse, distant reward (no predictive world model helping out, no reward shaping, etc) work reliably and for 10,000x less compute than an equivalent predictor-backed system, I'd probably be forced to downgrade the probability of goal agnostic systems a lot. Also, we'd probably explode.

  • I become convinced somehow that the fuzzier parts, like the degree to which we can reliably aim a strong system at useful things, are not like I thought in a way that makes the approach useless.

Get
Ṁ1,000
and
S3.00
Sort by:

@Bird any updates?

I do indeed still think it's a useful set of foundational intuitions, and that the implications up the industry stack still do hold (and look like they will continue to hold). To the extent research has touched on relevant questions, it's all consistent with the story I expect based on this model.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules