(5000M subsidy) Will I (porby) think "goal agnosticism" as a concept is still relevant/useful at the end of 2024?

I currently think goal agnostic systems, particularly a subset of predictors, have really nice foundational properties that give us a path to practically usable extreme capability without autodoom.

Some (beefy) background:
FAQ: What the heck is goal agnosticism? — LessWrong

Using predictors in corrigible systems — LessWrong

Resolves yes if on January 1, 2025:

  1. I still agree with the core arguments underlying goal agnosticism, how it can be used, and how it is likely to scale.

  2. I still think that AI research is on a path that makes roughly goal agnostic foundations a reasonable expectation: not guaranteed, but >15%-ish chance. (Current estimate: ~87%)

Note that resolving yes does not require that I am still working on things related to goal agnosticism.

Some example ways this could resolve no:

  • An experiment shows that simple current-style autoregressive, single token predictive loss over a reasonably broad training distribution still allows unconditional preferences over world states. "Wanting to predict well" instead of "predicting well" leading to locally loss-increasing steganography, for example.

  • The industry finds an easier path to extreme capability that doesn't lend itself to goal agnosticism. For example, if someone manages to make end-to-end reinforcement learning on a sparse, distant reward (no predictive world model helping out, no reward shaping, etc) work reliably and for 10,000x less compute than an equivalent predictor-backed system, I'd probably be forced to downgrade the probability of goal agnostic systems a lot. Also, we'd probably explode.

  • I become convinced somehow that the fuzzier parts, like the degree to which we can reliably aim a strong system at useful things, are not like I thought in a way that makes the approach useless.

Get Ṁ600 play money

More related questions