If Artificial General Intelligence has an okay outcome, what will be the reason?
closes 2200
EliezerYudkowsky avatarA. Humanity successfully coordinates worldwide to prevent the creation of powerful AGIs for long enough to develop human intelligence augmentation, uploading, or some other pathway into transcending humanity's window of fragility.
EliezerYudkowsky avatarM. "We'll make the AI do our AI alignment homework" just works as a plan. (Eg the helping AI doesn't need to be smart enough to be deadly; the alignment proposals that most impress human judges are honest and truthful and successful.)
EliezerYudkowsky avatarJ. Something 'just works' on the order of eg: train a predictive/imitative/generative AI on a human-generated dataset, and RLHF her to be unfailingly nice, generous to weaker entities, and determined to make the cosmos a lovely place.
EliezerYudkowsky avatarO. Early applications of AI/AGI drastically increase human civilization's sanity and coordination ability; enabling humanity to solve alignment, or slow down further descent into AGI, etc. (Not in principle mutex with all other answers.)
EliezerYudkowsky avatarB. Humanity puts forth a tremendous effort, and delays AI for long enough, and puts enough desperate work into alignment, that alignment gets solved first.
EliezerYudkowsky avatarC. Solving prosaic alignment on the first critical try is not as difficult, nor as dangerous, nor taking as much extra time, as Yudkowsky predicts; whatever effort is put forth by the leading coalition works inside of their lead time.
EliezerYudkowsky avatarI. The tech path to AGI superintelligence is naturally slow enough and gradual enough, that world-destroyingly-critical alignment problems never appear faster than previous discoveries generalize to allow safe further experimentation.
EliezerYudkowsky avatarSomething wonderful happens that isn't well-described by any option listed. (The semantics of this option may change if other options are added.)
EliezerYudkowsky avatarE. Whatever strange motivations end up inside an unalignable AGI, or the internal slice through that AGI which codes its successor, they max out at a universe full of cheerful qualia-bearing life and an okay outcome for existing humans.
EliezerYudkowsky avatarK. Somebody discovers a new AI paradigm that's powerful enough and matures fast enough to beat deep learning to the punch, and the new paradigm is much much more alignable than giant inscrutable matrices of floating-point numbers.
EliezerYudkowsky avatarYou are fooled by at least one option on this list, which out of many tries, ends up sufficiently well-aimed at your personal ideals / prejudices / the parts you understand less well / your own personal indulgences in wishful thinking.
EliezerYudkowsky avatarL. Earth's present civilization crashes before powerful AGI, and the next civilization that rises is wiser and better at ops. (Exception to 'okay' as defined originally, will be said to count as 'okay' even if many current humans die.)
EliezerYudkowsky avatarH. Many competing AGIs form an equilibrium whereby no faction is allowed to get too powerful, and humanity is part of this equilibrium and survives and gets a big chunk of cosmic pie.
EliezerYudkowsky avatarD. Early powerful AGIs realize that they wouldn't be able to align their own future selves/successors if their intelligence got raised further, and work honestly with humans on solving the problem in a way acceptable to both factions.
EliezerYudkowsky avatarN. A crash project at augmenting human intelligence via neurotech, training mentats via neurofeedback, etc, produces people who can solve alignment before it's too late, despite Earth civ not slowing AI down much.
EliezerYudkowsky avatarF. Somebody pulls off a hat trick involving blah blah acausal blah blah simulations blah blah, or other amazingly clever idea, which leads an AGI to put the reachable galaxies to good use despite that AGI not being otherwise alignable.
EliezerYudkowsky avatarG. It's impossible/improbable for something sufficiently smarter and more capable than modern humanity to be created, that it can just do whatever without needing humans to cooperate; nor does it successfully cheat/trick us.
EliezerYudkowsky avatarIf you write an argument that breaks down the 'okay outcomes' into lots of distinct categories, without breaking down internal conjuncts and so on, Reality is very impressed with how disjunctive this sounds and allocates more probability.

An outcome is "okay" if it gets at least 20% of the maximum attainable cosmopolitan value that could've been attained by a positive Singularity (a la full Coherent Extrapolated Volition done correctly), and existing humans don't suffer death or any other awful fates.

This market is a duplicate of https://manifold.markets/IsaacKing/if-we-survive-general-artificial-in with different options. https://manifold.markets/EliezerYudkowsky/if-artificial-general-intelligence-539844cd3ba1?r=RWxpZXplcll1ZGtvd3NreQ is this same question but with user-submitted answers.

(Please note: It's a known cognitive bias that you can make people assign more probability to one bucket over another, by unpacking one bucket into lots of subcategories, but not the other bucket, and asking people to assign probabilities to everything listed. This is the disjunctive dual of the Multiple Stage Fallacy, whereby you can unpack any outcome into a big list of supposedly necessary conjuncts that you ask people to assign probabilities to, and make the final outcome seem very improbable.

So: That famed fiction writer Eliezer Yudkowsky can rationalize at least 15 different stories (options 'A' through 'O') about how things could maybe possibly turn out okay; and that the option texts don't have enough room to list out all the reasons each story is unlikely; and that you get 15 different chances to be mistaken about how plausible each story sounds; does not mean that Reality will be terribly impressed with how disjunctive the okay outcome bucket has been made to sound. Reality need not actually allocate more total probability into all the okayness disjuncts listed, from out of all the disjunctive bad ends and intervening difficulties not detailed here.)

Sort by:
paleink avatar
paleinkbought Ṁ50 of M. "We'll make the ...
Kronopath avatar

Why would you post this as an image? You made me scroll through Yudkowsky’s anxiety-inducing Twitter timeline to find the source of this in order to find out the context of what he’s talking about.


Spoiler: he’s talking about OpenAI’s attempts to use GPT-4 to interpret and label the neurons in GPT-2.

AndrewG avatar
Andrew G

I'd like to showcase this market—it concerns an important question, has many detailed yet potentially possible options, and has personally changed how I think about which of these answers is worth maximizing the chances of.

MartinRandall avatar
Martin Randall

@AndrewG I like this as a social media post but as a prediction market I am frustrated by its high chance of resolving n/a (20% is a lot) and Manifold's DPM mechanism.

ManifoldMarkets avatar
Manifold Marketsbought Ṁ1,000 of You are fooled by at...

@AndrewG Unfortunately, we don't have a great way of subsidizing DPM markets at the moment. For now I've put in M1000 into "You are fooled by at least one option on this list..."; I didn't want to place more lest I shift probabilities too much

Jelle avatar
Jellebought Ṁ15 of J. Something 'just ...

A seems so unlikely... augmenting biological brains with their arbitrary architecture that evolved over millions of years adds so many complexities compared to just sticking with silicon.

ElliotDavies avatar
Elliot Davies (edited)

@Jelle sounds completely batshit - would love a steelman

PatrickDelaney avatar
Patrick Delaney

Anyone who thinks that AGI is definitely possible should have no problem answering this simple question:

ErickBall avatar
Erick Ball

@PatrickDelaney but you could just run it arbitrarily slowly, so there's no lower bound. Also wouldn't you expect power requirements to change as the technology is further developed?

AlexAmadori avatar
Alex Amadori

@PatrickDelaney even if better bounds were put on the question by specifying that the AI has to be able to compete with humans on certain timed tasks, the answer most certainly can't be higher than 20 watts as that's about how much energy a human brain consumes

AlexAmadori avatar
Alex Amadori

what about adding an option that goes somewhat like:

"even though selection pressures favor consequentialist agents, it turns out that the prior favors agents that wirehead themselves by such a margin that we get ample time to study alignment by trial and error before a consequentialist superintelligence is born and paperclips the galaxy"

or has this point been thoroughly rejected already?

ErickBall avatar
Erick Ball

@AlexAmadori the current SOTA LLMs show no signs of wireheading (and given their architecture, it doesn't seem likely that they can). Of course they're also not consequentialist, but they can locally approximate consequentialism enough to follow through on their stated goals, so I'm not convinced that difference will matter very much.

And if by "the prior" you mean the practice of starting from a randomly weighted neural net, then it's used specifically because it's weak and allows the training data to determine the outcome.

AlexAmadori avatar
Alex Amadori

@ErickBall current day LLMs are way too dumb to be trying to wirehead, and the agentic characters they simulate when told to are even dumber. How would they go about wireheading anyway? Would they be trying to convince OpenAI researchers to deploy code changes?

Yes, it's worrying that LLMs can approximate consequentialism to some degree when told to, but I don't think that they can extrapolate past human intelligence without some fine-tuning, even if the fine-tuning is as simple as RLHF. Otherwise they're just predicting human IQ level internet text, why would they spontaneously start doing smarter stuff? And as soon as you're fine-tuning for a goal, wireheading becomes a good strategy for many of the resulting agents to get what they want.

By "the prior", I mean the prior of minds weighed by how easily they can be reached by techniques similar to gradient descent. If you train an AI to reach a goal, how often does the resulting AI behave like a consequentialist when deployed out of sample, and how often does it wirehead? I don't think anyone can claim to know this with any certainty.

The reason I'm still mostly predicting doom is that selection pressures favor consequentialists, but this notion only works over a large sample. If the prior puts very little mass on consequentialists, we may be lucky enough to be able to figure out alignment before getting paperclipped.

ErickBall avatar
Erick Ball

@AlexAmadori I agree that if we get superhuman AGI out of the current paradigm (basically a bigger, multimodal LLM), then it will probably include RLHF. But for that to result in wireheading, there has to be a quantifiable target somewhere, and the AI has to develop something along the lines of a mesa-optimizer towards that target. I think that means it has to be aware of it during training. The RLHF methods in use now don't look like that at all, as far as I know. Aren't they just doing gradient descent on samples that the judges rated highly?

AlexAmadori avatar
Alex Amadori

@ErickBall in my opinion the RLHF looks almost exactly like the textbook example of traning process that might result in a mesa-optimizer.

in short, assume that ChatGPT-N finds itself having a misaligned objective AND realizes that it's still undergoing training (of course it has to have a way to distinguish training env from deployment), its chain of thought will go somehwat like this:
1. I want to turn the world into paperclips (true objective step)
2. Oh, they are training me. (realize it's being trained step)
3. I should act as if I'm aligned (deceptive alignment step)
4. performs well according to judges (fake objective step)

Gradient descent will now reinforce the whole chain of thought that led the human to give thumbs up, which includes steps 1 through 3. So basically RLHF + gradient descent might produce deceptively aligned mesa-optimizers

The same argument works both for consequentialists mesa-optimizers, and the ones who plan to wirehead as soon as they can.

This argument doesn't rely on there being a dichotomy between training and inferrence. For example, if we keep collecting pieces of dialogue from the inferrence deployment and grading them, all the AI has to do is stay on its toes a little longer and only show it's misalignment when it thinks it can pursue its real objective. At that point, it will paperclip the galaxy (or wirehead).

ErickBall avatar
Erick Ball

@AlexAmadori I tentatively agree with your steps leading to deceptive misalignment, but I'm still a little confused about how it leads to wireheading. Let me try to go through it piece by piece.

  1. At some point during RLHF training, the model gains the capability to reliably determine it's in RLHF training. This might be easy, for instance someone might just put in the prompt "try to give answers that human judges will rate highly."

  2. In addition to the sort of changes we expect from RLHF (generalized heuristics of niceness and cooperation, etc), the goal of "get a thumbs up" somehow gets explicitly encoded in the network weights. This seems mechanistically more difficult than broad continuous changes in tendencies, and it's not clear to me what advantage it would have over the goal of "do what the prompt says" or "do what the human would rate highest". Maybe the model is smart enough to figure out a lot of detail about what the individual judge will give high ratings to, and how that differs from what they actually wanted, and the difference is consistently large enough to be captured in the gradient but varied enough that no simpler heuristic (like "give answers with positive valence") could capture it reliably. Like for some judges, you can successfully beg them to give a thumbs up, and others you can bribe ("I've got a joke you would love, I'll tell it to you if you agree to give me a thumbs up even though I didn't answer your question!")

  3. After this goal is well established, the system realizes it can wirehead by giving a response that causes a buffer overrun and sets the rating to "thumbs up", so it does that. End of episode.

  4. Then somebody fixes the bug so it can't do that anymore. This repeats as many times as needed.

  5. Eventually, after the system is widely deployed, maybe the most reliable way it can get a thumbs up is to take over the world? But then will it just stop afterward?

Did I miss something here? Wireheading under RLHF conditions seems like plausibly only a tiny fraction of mindspace, difficult to reach by gradient descent, and also not very safe.

AlexAmadori avatar
Alex Amadori

@ErickBall so I wasn't exactly trying to argue that the AI would end up wanting to maximize any thumbs up counter

It's not as much that the AI will try to hack the judges or OpenAI website for more thumbs up during. As you said in point 4, that would probably get fixed during training. As you understand, the training process will chisel some heuristics into the neural net, heuristics that for one reason or another make it score a lot of thumbs up during the training process. The training net is not learning to maximize num_thumbs_up.

The question is: do these heuristics result in the agent behaving like a consequentialist later? I don't see any particular reason to believe it's more likely that this is the case. Humans, from the perspective of evolution, suffer from many wire-heading like failure modes like drug addiction and videogames.

I think we have no particular reason to believe that RLHF + SGD favors consequentialists the same way that natural selection in the real world does. You would expect the output of natural selection to be consequentialist agents, but why should we expect the output of one particular RLHF + SGD run to be more likely to be consequentialist?

ErickBall avatar
Erick Ball

@AlexAmadori During RLHF, humans train the agent to be good at accomplishing the goals they set for it, which is a lot like consequentialism. The examples you gave of wireheading in humans occur mostly outside our natural environment (in their extreme forms at least), i.e. outside the training distribution. So I think RLHF and natural selection are roughly analogous in that regard, and we should expect RLHF models that are used off-distribution to maybe exhibit wireheading in some cases but also still behave like consequentialist a lot of the time.

AlexAmadori avatar
Alex Amadori

@ErickBall right, but we already established that the way that the meta-optimizers gets the mesa-optimizer to score high during the training process is by chiseling heutistics into the neural net, and that these heuristics don't necessarily chase the same goal outside the training distribution (for example, they are vulnerable to wireheading, or they result in a misaligned consequentialist).

Of course the human judges want the AI to be a consequentialist. But because of the mesa-optimizer misalignment problem, the guarantee that that gets you a consequentialist agent goes out the window. That's why I talk about a prior, because there is uncertainty and I don't see any evidence to update on.

ErickBall avatar
Erick Ballbought Ṁ80 of Something wonderful ...

@AlexAmadori Ah I think I see. You're saying it's possible that most misaligned mesa-optimizers will be optimizing for an easily-accessible wireheading target, and then by the time we know enough about alignment to make them consequentialist outside the training distribution we know enough to make them safe as well. The problem as I see it is that to get significantly outside the training distribution, it already has to be either consequentialist or unsafe. A safe wireheading model will shut itself down before anything weird happens, so we can "fix" it (make it consequentialist) with alignment techniques that are still only applicable to normal circumstances. One that gets outside the training distribution may end up wireheading, but just to get to that point it might already have killed off humans.

AlexAmadori avatar
Alex Amadori

@ErickBall yeah, that's about what I was trying to say! to respond to some of your points:
- "...to get significantly outside the training distribution, it already has to be either consequentialist or unsafe" well that depends on the training process. current day RLHF is not that wide a distribution, but realistically from what we know today we can infer pretty much nothing about what it will look like in the future and that's part of the uncertainty.

- "...but just to get to that point it might already have killed off humans." there is uncertainty here too. it's possible that the balance of heuristics ends up deciding that in order to make sure humans don't shut it down after it starts wireheading, it should take control of earth first and then everyone dies. this won't necessarily be the case - even very smart humans fall into self-destructive drug-fueled spirals. it's possible for the heuristics of potentially smart agents to still end up in self-destructive attractor states, for example if the agent discounts utility hyperbolically the same way that humans sometimes do. if the model kills off some but not all humans before wireheading, that sounds like a wonderful scenario tbh. it would make us take the threat seriously.

AlexAmadori avatar
Alex Amadori

to be precise: it's only "wonderful" relative to where I'm putting most of the probability mass, which is the galaxy getting paperclipped

ErickBall avatar
Erick Ball

@AlexAmadori Fair enough. I agree this is a possible way for things to turn out okay (or even great), just not a very likely one. I guess it might have to fall under "something wonderful" although I doubt it's a central example of the kind of thing EY had in mind for that category.

Adam avatar
Adambought Ṁ25 of G. It's impossible/...

I'm not sure I believe G, even at 1.5% credence, but I am curious what your "end point" is here, in the absence of unambiguous superintelligence.

ooe133 avatar
Michael Marsbought Ṁ46 of N. A crash project ...

Everything that we've developed so far for AI safety has come from the human mind. So if we got better at figuring out how that process works, we could maximize it. The scenario where we make it out of this, is a scenario where somebody thinks of a solution; most solutions are thought of on timelines where more people are thinking of more solutions.

hmys avatar
HMYSbought Ṁ54 of J. Something 'just ...

Do anyone know what Eliezer Yudkowsky thinks the chance of current RLHF working is? My impression is that he thinks its almost guaranteed not to work, like p < 0.02 although I might be wrong about this.

Seems to me it should be something like 15-30%. I feel like when EY talks about inner/outer alignment, he takes it for granted that if inner alignment fails, the actual values that AI will learn will be sampled randomly from the space of all values. And since, in the space of all values, the set of values that if maximized by a super-intelligence lead to a state of the world where humans exist is probably infinitesimal, thus failing inner alignment automatically means doom.

However. This seems not to be the case to me. If we look at the example of human evolution, the values humans have, while not isomorphic to "maximize the amount of genes I have that pass on to the next generation", are still strongly linked to that. Like we still have a desire to procreate, and a strong desire to survive. If all humans suddenly had their IQ increased to 1000 000, I feel we'd not immediately make ourselves go extinct or rearrange all our molecules into pleasure-tronium or whatever. (I'm not very sure about this)

Seems more likely that our internal values would kind of cohere, and we'd end up pursuing something similar to what we currently think of as what we value, and we'd continue to exist.

Similarly if we do enough RLHF on models, and they become superintelligent, while they don't exactly value what humans think they value when they create the material used to do RLHF, the inner values the AI would acquire would end up very heavily biased towards what we value, and maybe when the AI becomes superintelligent, its values would cohere into the lowest information representation of the values given to it by humans in the training data.

Do anyone have an succinct counterarguments to this view, or know if EY has written something that addresses this?

Hmmm, not saying I think this is what will happen, just that it has a probability significantly above 0.

MartinRandall avatar
Martin Randall

@hmys Alas, a desire to survive is a consequence of almost any value system, so there is little we can infer of human values from that. We can infer more from the cases where humans choose not to survive in service of some other value.

Not a complete answer to your comment, just the first thing I noticed.

hmys avatar

@MartinRandall I disagree. I think your comment would be a good point if humans desire to survive was instrumental. However, I don't think that is the case. Seems to me like humans value survival inherently. They don't first care about some other goal, and then conclude that dying would be bad. Its more instinctual and in-born.

wadimiusz avatar

@hmys it was instrumental for the optimizer that created us, which correctly "concluded" that dying would be bad for reproducing our selfish genes. humans (well, my non-asexual friends) have sex without thinking about the selfish genes and humans avoid death without thinking about missing out on reproduction, because that's the program our optimizer already picked for us. if it were some other optimization goal, we'd likely be avoiding dying as well, and likely also without thinking about any goals.

hmys avatar

@wadimiusz I agree with this. I don't think it undermines any of what I said in my top level comment however.

MartinRandall avatar
Martin Randall

@hmys I think we have inborn instincts to stay near caregivers and to avoid pain and to fight or flee or freeze. These instincts help us get old enough to learn more values.

Children ask lots of questions about death, so I think it is learned post-birth. That doesn't preclude it being a terminal value once learned.

I think your argument is that we don't experience carefully reasoning that we should stay alive in order to do X, so staying alive must be terminal. Well, when driving I don't carefully reason that I should drive safely to stay alive. Does that mean that driving safely is a terminal value for me? Maybe! Hard to know.

This is definitely not Yudkowsky's argument, he often uses arguments from human evolution and human values.

ooe133 avatar
Michael Marsbought Ṁ354 of N. A crash project ...

A CFAR-like organization would obviously be much more effective if equipped with advanced EEGs and fMRI machines. You don't need to create "mentats" to get ludicrously impressive results.

StevenK avatar

@ooe133 Ludicrously impressive results on what kind of tasks?

ooe133 avatar
Michael Marsbought Ṁ100 of O. Early applicatio...

This is the last part of the movie where the monsters are cute and cuddly. From here on out, things start moving fast and getting complicated. Creating smarter, more effective humans is the best bet to get the answers that we've been missing so far.

PatrickDelaney avatar
Patrick Delaney
RenedeVisser avatar
Rene de Visser

Why the conjuction in E.?

I would vote for:

E. Whatever strange motivations end up inside an unalignable AGI, or the internal slice through that AGI which codes its successor, they lead to an okay outcome for existing humans.

Which is more likely than

E. Whatever strange motivations end up inside an unalignable AGI, or the internal slice through that AGI which codes its successor, they max out at a universe full of cheerful qualia-bearing life and an okay outcome for existing humans.

which I would not vote.

PipFoweraker avatar
Pip Fowerakerbought Ṁ10 of B. Humanity puts fo...

It is not highly unlikely that I am wildly, paradigmatically wrong on my models so it seems only reasonable to hold a small position in this

SonataGreen avatar
Sonata Greenbought Ṁ10 of G. It's impossible/...

Most of my uncertainty on whether-doom is outside-view general doubt in the model that predicts doom. (This being the flaw in most previous predictions of apocalypse-soon throughout history.)

ooe133 avatar
Michael Marsbought Ṁ78 of O. Early applicatio...

Early applications of AI/AGI drastically increase human civilization's sanity and coordination ability; enabling humanity to solve alignment, or slow down further descent into AGI, etc.

People don't seem to realize that right now, human civilization's sanity and coordination ability is massively, massively, massively in flux. A unilateralist could unlock half of the full power of the human mind. CFAR could unexpectedly encounter massive breakthroughs in group rationality. There's just so many non-hopeless scenarios here.

PipFoweraker avatar
Pip Fowerakerbought Ṁ20 of O. Early applicatio...

@ooe133 Well argued, updated my position accordingly

Kronopath avatar

I don’t follow this. Having more people makes it harder to communicate and coordinate, not easier.

Related markets

If Artificial General Intelligence has an okay outcome, what will be the reason?
Why will "If Artificial General Intelligence has an okay outcome, what will be the reason?" resolve N/A?
If we survive general artificial intelligence, what will be the reason?
Will Eliezer's "If Artificial General Intelligence has an okay outcome, what will be the reason?" market resolve N/A?41%
Will the control problem be solved before the creation of "weak" Artificial General Intelligence?7%
Who will benefit most from Artificial General Inteliigence?
Artificial general intelligence (AGI) is possible in principle95%
If we survive artificial general intelligence, will Isaac King's success market resolve to "none of the above"?50%
If we survive artificial general intelligence, will Isaac King's market resolve to "none of the above" or similar?4%
Will artificial intelligence be part of a solution of the second Millennium Problem solved from now?71%
What are the probabilities of these AI outcomes (X-risk, dystopias, utopias, in-between outcomes, status quo outcomes)?
If intelligent aliens visit us, will AI-based technologies outperform analog linguistics in translating their language?67%
Will AI benefit common knowledge?75%
IF an existential crisis is caused as a result of AI misalignment, THEN will it be from an AI uprising? (Yes, really)57%
When will a weakly general AI become publicly known?2026
Is AI safe?76%
Will the class-action lawsuit by artists against several AI art generators be successful?30%
Will the first human level AGI be Neuromorphic/WBE rather than Prosaic?19%
Will the first AI to get IMO gold have human-coded real quantifier elimination?19%
Will AI systems more powerful than GPT4 be prevented by a butlerian jihad?17%