Before 2028, will there be a major self-improving AI policy*?
10
169
210
2028
77%
chance

Background.

*Resolution conditions (all must apply):

  1. An AI of the form P(action|context) rather than e.g. E[value|context, action] must be a part of a major AI system. For instance language models such as ChatGPT or Sydney would currently count for this.

  2. The aftermath of its chosen actions must at least sometimes be recorded, and the recordings must be used to estimate what could have usefully been done differently. RLHF finetuning as it is done today does not count because it solely involves looking at the actions, but e.g. I bet the Sydney team at Bing probably had internal discussions about this incident and those internal discussions would count.

  3. This must be continually used to update P(action|context) to improve itself.

  4. Criteria 2 and 3 must be handled by the AI itself, not by humans or by some other AI system. (This means that Sydney wouldn't count, nor would any standard actor-critic system.)

It does not have to be reflective, i.e. it do not have to consider the aftermath of its self-improvements and improve its self-improvement. Improvements to its self-improvement are allowed to be handled by people manually, by RLHF, by actor-critic methods, or lots of other options.

I will not be trading in this market.

Get Ṁ200 play money
Sort by:

Naive question – I’ll appreciate just a short pointer in the right direction. Why is the P(action|context) versus E[value|context, action] distinction important here? I intuit that the former packs more information – the outer layer of the black box is transparent – so it may well turn out to be technically necessary for a tractable solution, but why do we need to tie the market to this specific form? <arguably> Humans appear to have an intrinsic online reinforcement system (emotions and motivations) yet we’re closer to black boxes, both to external observers and even to our own introspection. </arguably> Would a human pass the test posed by this market?

Would a human pass the test posed by this market?

@yaboi69 A human wouldn't pass the test posed by this market. For instance humans seem better described by the brain-like AGI model, where both things resembling P(action|context) and E[value|context] are present and play an important role.

Why is the P(action|context) versus E[value|context, action] distinction important here? I intuit that the former packs more information – the outer layer of the black box is transparent – so it may well turn out to be technically necessary for a tractable solution, but why do we need to tie the market to this specific form?

I would say that the latter packs more information? You can turn an E[value|context, action] into a P(action|context) using an argmax (or softmax, for exploration and smoother properties and such).

I guess I should clarify that I've had a bunch of back-and-forth with some alignment researchers, where the alignment researchers expect future AIs to look a lot like GPT-3 with its P(action|context) stuff ("simulators", "shard theory"), and I don't know what exactly I expect it to look like, but think it could look like e.g. brain-like AGI and don't expect it to look like GPT-3.

This market is basically what happens if I condition on them being right and me being wrong, and then ask what the biggest/most powerful AI that will result from their approach might look like.

I also do not see the distinction you're drawing between "looking at the actions" and "looking at the aftermath of the actions" such that RLHF is one but not the other. What exactly counts as aftermath?

@vluzko There are versions of RLHF that do look at the aftermath of the actions, but I think the ones used in the most prominent models (e.g. ChatGPT) do not do that?

RLHF that does look at the aftermath does count for 2 (though by definition the HF part doesn't count for 4, but usually for efficiency RLHF is split so that the humans don't directly give reward but instead train a reward model, and in that case it also counts for 4).

@tailcalled Again, what are you counting as the aftermath? Like, what specific pieces of data are saying it needs to look at?

@vluzko It depends on the query/task it is asked to do. Can you give a specific example where it is unclear what would count as the aftermath vs the actions?

@tailcalled To be honest, the whole thing is unclear. "action" is standard jargon for RL but it's already not the only thing that gets used by RLHF so I'm not sure what you mean, and "aftermath" is just not standard jargon. Do you mean reward at later states? Later states themselves? Some other thing?

@vluzko For a prototypical case of what I'm imagining, I'm thinking of something like a GPT-based AI that performs DevOps work.

Obviously with existing AIs like GPT, you can prompt them to barf out some algorithm that maybe solves a problem. Imagine that our DevOps AI did that, and then deployed the code to PROD, but then the code caused PROD to crash.

In that case, we have a clear distinction between action (code + choice to deploy it) and aftermath (the system crashed).

Then we might imagine that the DevOps AI investigates the crash, figures out what went wrong, figures out what it should have done ahead of time to so it wouldn't have crashed PROD. And then it updates its own policy to be more likely to do that.

@tailcalled I think that basically any RL system is already taking "aftermath" into account by that definition, including RLHF systems. If you disagree can you please say what "aftermath" is in the context of LLMs, that RLHF is currently not taking into account?

@vluzko I literally just gave an example of an aftermath in the context of LLMs that I don't think RLHF is currently taking into account? AFAIK in no point during the training of any LLM did it get to run code on a production server and get graded on real-world user complaints?

@tailcalled There is no LLM that does devops. Can you give an example in terms of an LLM that actually currently exists?

Ideally an example in terms of, say, ChatGPT.

@vluzko How do you define whether an LLM "does" some kind of work?

Can you list some examples of kinds of work that ChatGPT "does"?

@tailcalled It's not my market so I don't see how my definition could matter, but sure: I'd say that ChatGPT is just meant to produce text that users will rate highly, and by that standard it can look at the aftermath of its actions within RLHF, because you can (and generally do) give it n-step return estimates.
If you wanted you could finetune ChatGPT on devops input/output pairs, and then do RLHF where it gets back the output of the terminal and you train a reward model etc etc, and that's all within the existing RLHF paradigm, so I don't understand what "aftermath" it isn't getting such that this market isn't already resolved, which is why I'm being such a stick in the mud about what aftermath means for ChatGPT.

Why exactly does actor-critic not count? Are you saying that if there is more than one neural network involved then it doesn't count? What if it's a single neural network with two output heads (as is somewhat common with actor-critic methods)?

@vluzko This is a great question. I've been thinking about it for a while as a result of your comment, and I think the issue is that I had the intuition that actor-critic methods don't count because they differ from the methods I had in mind, but that doesn't mean that they differ in the specific way that I wrote down.

What I want to get at is something with a more "unified" mind than traditional actor-critic methods. If e.g. the policy has learned to improve its results using chain-of-thought approaches, then the self-improvement should also be capable of using chain-of-thought to have better self-improvement.

If you just threw a value head next to the policy head, then the values would be much more rigid, as it would not be able to add extra context, investigations, deductions, etc.

Reflective version: