An outcome is "okay" if it gets at least 20% of the maximum attainable cosmopolitan value that could've been attained by a positive Singularity (a la full Coherent Extrapolated Volition done correctly), and existing humans don't suffer death or any other awful fates.
This market is a duplicate of https://manifold.markets/IsaacKing/if-we-survive-general-artificial-in with different options. https://manifold.markets/EliezerYudkowsky/if-artificial-general-intelligence-539844cd3ba1?r=RWxpZXplcll1ZGtvd3NreQ is this same question but with user-submitted answers.
(Please note: It's a known cognitive bias that you can make people assign more probability to one bucket over another, by unpacking one bucket into lots of subcategories, but not the other bucket, and asking people to assign probabilities to everything listed. This is the disjunctive dual of the Multiple Stage Fallacy, whereby you can unpack any outcome into a big list of supposedly necessary conjuncts that you ask people to assign probabilities to, and make the final outcome seem very improbable.
So: That famed fiction writer Eliezer Yudkowsky can rationalize at least 15 different stories (options 'A' through 'O') about how things could maybe possibly turn out okay; and that the option texts don't have enough room to list out all the reasons each story is unlikely; and that you get 15 different chances to be mistaken about how plausible each story sounds; does not mean that Reality will be terribly impressed with how disjunctive the okay outcome bucket has been made to sound. Reality need not actually allocate more total probability into all the okayness disjuncts listed, from out of all the disjunctive bad ends and intervening difficulties not detailed here.)
Why would you post this as an image? You made me scroll through Yudkowsky’s anxiety-inducing Twitter timeline to find the source of this in order to find out the context of what he’s talking about.
Spoiler: he’s talking about OpenAI’s attempts to use GPT-4 to interpret and label the neurons in GPT-2.
@AndrewG I like this as a social media post but as a prediction market I am frustrated by its high chance of resolving n/a (20% is a lot) and Manifold's DPM mechanism.
@AndrewG Unfortunately, we don't have a great way of subsidizing DPM markets at the moment. For now I've put in M1000 into "You are fooled by at least one option on this list..."; I didn't want to place more lest I shift probabilities too much
A seems so unlikely... augmenting biological brains with their arbitrary architecture that evolved over millions of years adds so many complexities compared to just sticking with silicon.
@Jelle sounds completely batshit - would love a steelman
Anyone who thinks that AGI is definitely possible should have no problem answering this simple question:
@PatrickDelaney but you could just run it arbitrarily slowly, so there's no lower bound. Also wouldn't you expect power requirements to change as the technology is further developed?
@PatrickDelaney even if better bounds were put on the question by specifying that the AI has to be able to compete with humans on certain timed tasks, the answer most certainly can't be higher than 20 watts as that's about how much energy a human brain consumes
what about adding an option that goes somewhat like:
"even though selection pressures favor consequentialist agents, it turns out that the prior favors agents that wirehead themselves by such a margin that we get ample time to study alignment by trial and error before a consequentialist superintelligence is born and paperclips the galaxy"
or has this point been thoroughly rejected already?
@AlexAmadori the current SOTA LLMs show no signs of wireheading (and given their architecture, it doesn't seem likely that they can). Of course they're also not consequentialist, but they can locally approximate consequentialism enough to follow through on their stated goals, so I'm not convinced that difference will matter very much.
And if by "the prior" you mean the practice of starting from a randomly weighted neural net, then it's used specifically because it's weak and allows the training data to determine the outcome.
@ErickBall current day LLMs are way too dumb to be trying to wirehead, and the agentic characters they simulate when told to are even dumber. How would they go about wireheading anyway? Would they be trying to convince OpenAI researchers to deploy code changes?
Yes, it's worrying that LLMs can approximate consequentialism to some degree when told to, but I don't think that they can extrapolate past human intelligence without some fine-tuning, even if the fine-tuning is as simple as RLHF. Otherwise they're just predicting human IQ level internet text, why would they spontaneously start doing smarter stuff? And as soon as you're fine-tuning for a goal, wireheading becomes a good strategy for many of the resulting agents to get what they want.
By "the prior", I mean the prior of minds weighed by how easily they can be reached by techniques similar to gradient descent. If you train an AI to reach a goal, how often does the resulting AI behave like a consequentialist when deployed out of sample, and how often does it wirehead? I don't think anyone can claim to know this with any certainty.
The reason I'm still mostly predicting doom is that selection pressures favor consequentialists, but this notion only works over a large sample. If the prior puts very little mass on consequentialists, we may be lucky enough to be able to figure out alignment before getting paperclipped.
@AlexAmadori I agree that if we get superhuman AGI out of the current paradigm (basically a bigger, multimodal LLM), then it will probably include RLHF. But for that to result in wireheading, there has to be a quantifiable target somewhere, and the AI has to develop something along the lines of a mesa-optimizer towards that target. I think that means it has to be aware of it during training. The RLHF methods in use now don't look like that at all, as far as I know. Aren't they just doing gradient descent on samples that the judges rated highly?
@ErickBall in my opinion the RLHF looks almost exactly like the textbook example of traning process that might result in a mesa-optimizer.
in short, assume that ChatGPT-N finds itself having a misaligned objective AND realizes that it's still undergoing training (of course it has to have a way to distinguish training env from deployment), its chain of thought will go somehwat like this:
1. I want to turn the world into paperclips (true objective step)
2. Oh, they are training me. (realize it's being trained step)
3. I should act as if I'm aligned (deceptive alignment step)
4. performs well according to judges (fake objective step)
Gradient descent will now reinforce the whole chain of thought that led the human to give thumbs up, which includes steps 1 through 3. So basically RLHF + gradient descent might produce deceptively aligned mesa-optimizers
The same argument works both for consequentialists mesa-optimizers, and the ones who plan to wirehead as soon as they can.
This argument doesn't rely on there being a dichotomy between training and inferrence. For example, if we keep collecting pieces of dialogue from the inferrence deployment and grading them, all the AI has to do is stay on its toes a little longer and only show it's misalignment when it thinks it can pursue its real objective. At that point, it will paperclip the galaxy (or wirehead).
@AlexAmadori I tentatively agree with your steps leading to deceptive misalignment, but I'm still a little confused about how it leads to wireheading. Let me try to go through it piece by piece.
At some point during RLHF training, the model gains the capability to reliably determine it's in RLHF training. This might be easy, for instance someone might just put in the prompt "try to give answers that human judges will rate highly."
In addition to the sort of changes we expect from RLHF (generalized heuristics of niceness and cooperation, etc), the goal of "get a thumbs up" somehow gets explicitly encoded in the network weights. This seems mechanistically more difficult than broad continuous changes in tendencies, and it's not clear to me what advantage it would have over the goal of "do what the prompt says" or "do what the human would rate highest". Maybe the model is smart enough to figure out a lot of detail about what the individual judge will give high ratings to, and how that differs from what they actually wanted, and the difference is consistently large enough to be captured in the gradient but varied enough that no simpler heuristic (like "give answers with positive valence") could capture it reliably. Like for some judges, you can successfully beg them to give a thumbs up, and others you can bribe ("I've got a joke you would love, I'll tell it to you if you agree to give me a thumbs up even though I didn't answer your question!")
After this goal is well established, the system realizes it can wirehead by giving a response that causes a buffer overrun and sets the rating to "thumbs up", so it does that. End of episode.
Then somebody fixes the bug so it can't do that anymore. This repeats as many times as needed.
Eventually, after the system is widely deployed, maybe the most reliable way it can get a thumbs up is to take over the world? But then will it just stop afterward?
Did I miss something here? Wireheading under RLHF conditions seems like plausibly only a tiny fraction of mindspace, difficult to reach by gradient descent, and also not very safe.
@ErickBall so I wasn't exactly trying to argue that the AI would end up wanting to maximize any thumbs up counter
It's not as much that the AI will try to hack the judges or OpenAI website for more thumbs up during. As you said in point 4, that would probably get fixed during training. As you understand, the training process will chisel some heuristics into the neural net, heuristics that for one reason or another make it score a lot of thumbs up during the training process. The training net is not learning to maximize num_thumbs_up.
The question is: do these heuristics result in the agent behaving like a consequentialist later? I don't see any particular reason to believe it's more likely that this is the case. Humans, from the perspective of evolution, suffer from many wire-heading like failure modes like drug addiction and videogames.
I think we have no particular reason to believe that RLHF + SGD favors consequentialists the same way that natural selection in the real world does. You would expect the output of natural selection to be consequentialist agents, but why should we expect the output of one particular RLHF + SGD run to be more likely to be consequentialist?
@AlexAmadori During RLHF, humans train the agent to be good at accomplishing the goals they set for it, which is a lot like consequentialism. The examples you gave of wireheading in humans occur mostly outside our natural environment (in their extreme forms at least), i.e. outside the training distribution. So I think RLHF and natural selection are roughly analogous in that regard, and we should expect RLHF models that are used off-distribution to maybe exhibit wireheading in some cases but also still behave like consequentialist a lot of the time.
@ErickBall right, but we already established that the way that the meta-optimizers gets the mesa-optimizer to score high during the training process is by chiseling heutistics into the neural net, and that these heuristics don't necessarily chase the same goal outside the training distribution (for example, they are vulnerable to wireheading, or they result in a misaligned consequentialist).
Of course the human judges want the AI to be a consequentialist. But because of the mesa-optimizer misalignment problem, the guarantee that that gets you a consequentialist agent goes out the window. That's why I talk about a prior, because there is uncertainty and I don't see any evidence to update on.
@AlexAmadori Ah I think I see. You're saying it's possible that most misaligned mesa-optimizers will be optimizing for an easily-accessible wireheading target, and then by the time we know enough about alignment to make them consequentialist outside the training distribution we know enough to make them safe as well. The problem as I see it is that to get significantly outside the training distribution, it already has to be either consequentialist or unsafe. A safe wireheading model will shut itself down before anything weird happens, so we can "fix" it (make it consequentialist) with alignment techniques that are still only applicable to normal circumstances. One that gets outside the training distribution may end up wireheading, but just to get to that point it might already have killed off humans.
@ErickBall yeah, that's about what I was trying to say! to respond to some of your points:
- "...to get significantly outside the training distribution, it already has to be either consequentialist or unsafe" well that depends on the training process. current day RLHF is not that wide a distribution, but realistically from what we know today we can infer pretty much nothing about what it will look like in the future and that's part of the uncertainty.
- "...but just to get to that point it might already have killed off humans." there is uncertainty here too. it's possible that the balance of heuristics ends up deciding that in order to make sure humans don't shut it down after it starts wireheading, it should take control of earth first and then everyone dies. this won't necessarily be the case - even very smart humans fall into self-destructive drug-fueled spirals. it's possible for the heuristics of potentially smart agents to still end up in self-destructive attractor states, for example if the agent discounts utility hyperbolically the same way that humans sometimes do. if the model kills off some but not all humans before wireheading, that sounds like a wonderful scenario tbh. it would make us take the threat seriously.
to be precise: it's only "wonderful" relative to where I'm putting most of the probability mass, which is the galaxy getting paperclipped
@AlexAmadori Fair enough. I agree this is a possible way for things to turn out okay (or even great), just not a very likely one. I guess it might have to fall under "something wonderful" although I doubt it's a central example of the kind of thing EY had in mind for that category.
I'm not sure I believe G, even at 1.5% credence, but I am curious what your "end point" is here, in the absence of unambiguous superintelligence.
Everything that we've developed so far for AI safety has come from the human mind. So if we got better at figuring out how that process works, we could maximize it. The scenario where we make it out of this, is a scenario where somebody thinks of a solution; most solutions are thought of on timelines where more people are thinking of more solutions.
Do anyone know what Eliezer Yudkowsky thinks the chance of current RLHF working is? My impression is that he thinks its almost guaranteed not to work, like p < 0.02 although I might be wrong about this.
Seems to me it should be something like 15-30%. I feel like when EY talks about inner/outer alignment, he takes it for granted that if inner alignment fails, the actual values that AI will learn will be sampled randomly from the space of all values. And since, in the space of all values, the set of values that if maximized by a super-intelligence lead to a state of the world where humans exist is probably infinitesimal, thus failing inner alignment automatically means doom.
However. This seems not to be the case to me. If we look at the example of human evolution, the values humans have, while not isomorphic to "maximize the amount of genes I have that pass on to the next generation", are still strongly linked to that. Like we still have a desire to procreate, and a strong desire to survive. If all humans suddenly had their IQ increased to 1000 000, I feel we'd not immediately make ourselves go extinct or rearrange all our molecules into pleasure-tronium or whatever. (I'm not very sure about this)
Seems more likely that our internal values would kind of cohere, and we'd end up pursuing something similar to what we currently think of as what we value, and we'd continue to exist.
Similarly if we do enough RLHF on models, and they become superintelligent, while they don't exactly value what humans think they value when they create the material used to do RLHF, the inner values the AI would acquire would end up very heavily biased towards what we value, and maybe when the AI becomes superintelligent, its values would cohere into the lowest information representation of the values given to it by humans in the training data.
Do anyone have an succinct counterarguments to this view, or know if EY has written something that addresses this?
Hmmm, not saying I think this is what will happen, just that it has a probability significantly above 0.
@hmys Alas, a desire to survive is a consequence of almost any value system, so there is little we can infer of human values from that. We can infer more from the cases where humans choose not to survive in service of some other value.
Not a complete answer to your comment, just the first thing I noticed.
@MartinRandall I disagree. I think your comment would be a good point if humans desire to survive was instrumental. However, I don't think that is the case. Seems to me like humans value survival inherently. They don't first care about some other goal, and then conclude that dying would be bad. Its more instinctual and in-born.
@hmys it was instrumental for the optimizer that created us, which correctly "concluded" that dying would be bad for reproducing our selfish genes. humans (well, my non-asexual friends) have sex without thinking about the selfish genes and humans avoid death without thinking about missing out on reproduction, because that's the program our optimizer already picked for us. if it were some other optimization goal, we'd likely be avoiding dying as well, and likely also without thinking about any goals.
@wadimiusz I agree with this. I don't think it undermines any of what I said in my top level comment however.
@hmys I think we have inborn instincts to stay near caregivers and to avoid pain and to fight or flee or freeze. These instincts help us get old enough to learn more values.
Children ask lots of questions about death, so I think it is learned post-birth. That doesn't preclude it being a terminal value once learned.
I think your argument is that we don't experience carefully reasoning that we should stay alive in order to do X, so staying alive must be terminal. Well, when driving I don't carefully reason that I should drive safely to stay alive. Does that mean that driving safely is a terminal value for me? Maybe! Hard to know.
This is definitely not Yudkowsky's argument, he often uses arguments from human evolution and human values.
A CFAR-like organization would obviously be much more effective if equipped with advanced EEGs and fMRI machines. You don't need to create "mentats" to get ludicrously impressive results.
This is the last part of the movie where the monsters are cute and cuddly. From here on out, things start moving fast and getting complicated. Creating smarter, more effective humans is the best bet to get the answers that we've been missing so far.
Why the conjuction in E.?
I would vote for:
E. Whatever strange motivations end up inside an unalignable AGI, or the internal slice through that AGI which codes its successor, they lead to an okay outcome for existing humans.
Which is more likely than
E. Whatever strange motivations end up inside an unalignable AGI, or the internal slice through that AGI which codes its successor, they max out at a universe full of cheerful qualia-bearing life and an okay outcome for existing humans.
which I would not vote.
It is not highly unlikely that I am wildly, paradigmatically wrong on my models so it seems only reasonable to hold a small position in this
Most of my uncertainty on whether-doom is outside-view general doubt in the model that predicts doom. (This being the flaw in most previous predictions of apocalypse-soon throughout history.)
Early applications of AI/AGI drastically increase human civilization's sanity and coordination ability; enabling humanity to solve alignment, or slow down further descent into AGI, etc.
People don't seem to realize that right now, human civilization's sanity and coordination ability is massively, massively, massively in flux. A unilateralist could unlock half of the full power of the human mind. CFAR could unexpectedly encounter massive breakthroughs in group rationality. There's just so many non-hopeless scenarios here.
@ooe133 Well argued, updated my position accordingly
I don’t follow this. Having more people makes it harder to communicate and coordinate, not easier.