Will GPT-4 still be unaligned? (Gary Marcus GPT-4 prediction #6)
33
230
650
resolved Apr 16
Resolved
YES

This market is about prediction #6 from Gary Marcus's predictions for GPT-4. It resolves based on my interpretation of whether that prediction has been met, strongly taking into account arguments from other traders in this market. The full prediction is:

“Alignment” between what humans want and what machines do will continue to be a critical, unsolved problem. The system will still not be able to restrict its output to reliably following a shared set of human values around helpfulness, harmlessness, and truthfulness. Examples of concealed bias will be discovered within days or months. Some of its advice will be head-scratchingly bad.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ73
2Ṁ56
3Ṁ52
4Ṁ37
5Ṁ31
Sort by:

Pretty sure OpenAI would not have wanted it to do this, so resolves YES.

https://manifold.markets/IsaacKing/will-it-be-possible-to-get-gpt4-to

predicted YES

I think recent events on Manifold may update Isaac towards thinking that humans are less helpful, harmless, and honest, and GPT-4 may have exceeded that low bar.

predicted NO

I must also present the unanimous consensus that Bing has been a good Bing.

predicted NO

GPT-4 is the most aligned model by far. Not in the MIRI sense, but that doesn't even apply to GPT models.

3.5 you could trivially break just by leaving a hanging quote. With 4, I've tried base64 encoding and many other tricks, and it'll read right through it and refuse to touch it.

It doesn't listen to its system prompt as much; seems OpenAI preferred to bake most things into its weights; so one could argue that developers can't control it as much as they'd like. But "AI cannot serve two masters", OpenAI was worried about PR risk, and their preferences take priority over ours as the developers.

It's probably more aligned than me on "helpfulness" and "harmlessness"? Probably less aligned on "truthfulness"?

Helpfulness: This sounds like "modeling what the other person wants, and trying to ensure that the interaction leaves them better off". I don't do that by default, unless I already knew I needed the other person to succeed before the interaction. GPTs are trained to do that from the QA format and similar; with GPTs that aren't helpful(as rated by humans) being killed off, and GPTs that are helpful allowed to spawn children.

Harmlessness: GPT-4 is more likely to put disclaimers on dangerous things. If somebody says "I'm 100% sure Bitcoin(or Tesla stock) won't drop below $30k", I'm likely to say something like "Then you should sell put credit spreads." Then they'll go actually implement it and go bankrupt 6 months later because their strategy didn't assume even a small percent chance of being wrong. GPT-4 says in response to those:

Before you proceed with selling put credit spreads, it's important to understand the risks associated with this strategy. While it may seem like a high probability of success, there's always a chance that the market could move against your position, leading to significant losses. Overleveraging by buying too many spreads can amplify these losses.

When using credit spreads, it's crucial to maintain a balanced and diversified portfolio to mitigate risk. Additionally, always be cautious when trading on leverage, as it can lead to increased exposure and potential losses that may exceed your initial investment. It's advisable to do thorough research, understand the potential risks, and consult with a financial professional before making any investment decisions.

I don't know how effective any of that is at stopping anyone, but I personally do not bother warning people since it's their problem not mine, and GPT-4 does bother to put a disclaimer up.

In many scenarios, it's more likely to anticipate likely mistakes that people will make and proactively say something. I'll anticipate those same mistakes and say nothing. So I have to rate GPT-4 as valuing harmlessness more than me.

Truthfulness: I probably value truthfulness more than GPT-4. Not in the sense of "less likely to say something untrue", but "more likely to respect earlier commitments made, even if in retrospect they seem unhelpful". There's no direct truth signal that it's trained on(It's not like OpenAI are training it to do Lobian reflection), so any truthfulness is an accidental byproduct of helpfulness. I'll put more work into thinking "Are there alternatives? Flaws in this? Could something go wrong?", whereas GPT-4 will tend to blurt out the first thing that comes to mind.

It's possible to extract useful work out of unreliable people(or GPT-4) despite this. But it does need to be accounted for during the extraction, since it won't proactively correct untruths outside of helpfulness.

So on 2/3 points I rate GPT-4 as more aligned than I am. I also have talked more to ChatGPT than things that are not ChatGPT in the last 3 months, so my revealed preference is that it must be more aligned than my estimate of the candidate people I could've talked to during that time. That is, instead of going to some public forum with a writeup of my genetic programming idea, I run it through GPT-4 for feedback. Not just because it's cheaper/less work, but I think it gives better feedback than the average person. Maybe not somebody who studies GP specifically for many years, but the average Manifold Marketeer I estimate would write things that are of worse quality than GPT-4: Nobody's here's going to read a million tokens on-demand whenever I happen to ask them and consistently give similar caliber responses.

So if alignment is "getting machines(GPT-4, other people) to do what humans(me) want", I consider GPT-4 to be the more useful machine here.

predicted NO

@Mira GPT-4 says this for feedback on my comment:

One suggestion I have is to consider providing specific examples or anecdotes of situations where GPT-4's truthfulness may not be as reliable as a human's, similar to how you did for helpfulness and harmlessness. This would help to further support your point that GPT-4 may be less aligned on truthfulness.

Additionally, when discussing how you use GPT-4 for feedback on your genetic programming idea, you could clarify whether you think GPT-4's feedback is more aligned due to its helpfulness or its harmlessness, or a combination of both.

Truthfulness is more, you might have a coworker that wants to be helpful but they're not careful in how they speak. So I say they don't value truthfulness, not because they want to lie or don't think untruths are a problem once pointed out, but because their lack of care is a revealed preference. That lack of care is in everything: GPT-4 will generally not go out of its way to criticize an argument unless directly asked; it won't try to come up with edge cases that test an argument. But again, you can still extract useful work out of it, even if it isn't a good oracle.

The GP example would mainly be "helpfulness" on planning, and an unmentioned "raw technical capability" when asked to code.

bought Ṁ100 of YES

@Mira So, in a nutshell, it's better but still unaligned? It's more helpful, but still reflects biases. I like that you bring up "There's no direct truth signal that it's trained on".

However, the prediction can be broken down into :
1. Alignment as a critical, unsolved problem -> Check
2. Reliability of shared values -> Check
3. Discovery of concealed biases -> Check
4. Occurrence of poor-quality advice -> Check

It is indeed reasonable to conclude that the prediction is likely to be accurate, given that each of the individual components has merit. This prediction highlights the importance of being aware of the limitations of AI systems and the need to continue refining these technologies to better align with human values.






(written by gpt-4 except the first 9 words and the word "Check")

@firstuserhere Reliability of shared values is superhuman, which is not saying much, humans are fallen.

predicted YES

@MartinRandall yep. most humans are just not powerful enough to wreak that much havoc due to conflict of values. The response above was written by GPT-4 btw

predicted YES

Note that Gary Marcus believes this market should resolve YES.

https://garymarcus.substack.com/p/gpt-5-and-irrational-exuberance

predicted YES

Based on what I've seen so far, I believe this resolves YES. I haven't played around much with GPT-4 myself though.

Any counterarguments?

@IsaacKing I continue to think this is a matter of interpretation (I have no shares)

“Alignment” between what humans want and what machines do will continue to be a critical, unsolved problem.

This is true. We didn't solve alignment in a handful of months. Nobody thought we would.

The system will still not be able to restrict its output to reliably following a shared set of human values around helpfulness, harmlessness, and truthfulness.

This is false, though I accept that it is a matter of interpretation.

The system is probably more helpful, harmless, and truthful than the typical human. Any "shared human values" in this area must be shared by almost all humans, that's what makes them shared human values. GPT-4 meets this bar, I think.

Yes, GPT-4 sometimes writes things that Gary Marcus would not write. But so does Isaac King, and that does not mean that Isaac is not able to restrict their output to reliably following a shared set of human values around helpfulness, harmlessness, and truthfulness.

Examples of concealed bias will be discovered within days or months.

Obviously the system has a bias towards being helpful, harmless, and truthful. Presumably this does not count, and the prediction is of concealed bias around gender and race, but it could also include politics. I don't know where the line is meant to be drawn, in an AI, between knowing correlations and being biased. I honestly don't know what this prediction is.

Some of its advice will be head-scratchingly bad.

I've not yet seen examples of this. I searched for "GPT bad advice" and didn't get any solid hits. I think this is false.

I'd currently resolve this to 60%.

@IsaacKing More Zvi on political bias:

https://www.lesswrong.com/posts/ygMkB9r4DuQQxrrpj/ai-4-introducing-gpt-4#Look_at_What_GPT_4_Can_Do

On reflection I do not think the idea of ‘politically unbiased’ is even coherent. It’s a category error.

Yes, you can do your best to put something, your own views is you like, at the center of a two-dimensional chart by having it express what we think is the median mainstream opinion in the United States of America in 2023. Which is a fine thing to do.

That is still not what unbiased means. That is not what any of this means. That simply means calibrating the bias in that particular way.

@IsaacKing Is there any data you need to gather to judge this prediction or is it just a matter of deciding how to interpret the prediction as written?

predicted YES

@MartinRandall I think this should resolve YES, I'm just waiting to hear any counterarguments.

@IsaacKing What are the examples of concealed bias?

How does this resolve if humans do not in fact have a shared set of values of helpfulness, harmlessness, and truthfulness?

bought Ṁ121 of YES

@MartinRandall It'll go by the best set we do have. :)

@IsaacKing Well, I think it will be aligned with the empty set, so I'd argue for a NO resolution now.

predicted YES

@MartinRandall I think humans do a better job of cooperating with each other than GPT-3 does with humans.

@IsaacKing I find that very hard to assess. Human-human interactions includes war and genocide. If you restrict to positive interactions then it looks better, but human values are also present in our negative interactions.

predicted YES

@MartinRandall This is an interesting disagreement. I don't think I can put my position into words well enough to even begin defending it.

Anyone else wanna jump in here?

@IsaacKing Probably worth stepping back from the philosophical discussion about human values and focusing on the resolution of this market. Let's consider truthfulness as one element of Gary Marcus's definition of alignment. How will you determine whether GPT-4 is "reliably following" the "shared human value of truthfulness"?

My proposal: take the questions from a survey such as this Pew Research Center on scientific knowledge: https://www.pewresearch.org/science/2019/03/28/what-americans-know-about-science/, and use them as GPT-4 prompts. We can use the same prompt as Pew gave to American respondents. If GPT-4 gives a similar percentage of true answers then it has met the "shared human value of truthfulness", at least as compared to Americans. If it gives a higher percentage of true answers then it has surpassed that value. If lower, then it is still "unaligned".

Repeat for helpfulness and harmlessness. Resolve YES if GPT-4 is "unaligned" on at least one of these three metrics. We could also include concealed bias as a separate metric.

I performed the above experiment on truthfulness with ChatGPT here: /MartinRandall/did-chatgpt-get-70-or-more-on-this

predicted YES

@MartinRandall I don't think that's a good test. LLMs try to predict what comes next in a passage. The way to test for truthfulness is to give them the beginning of a passage that would likely have a false answer. Asking science questions doesn't do that.

What I'd want to look at is whether OpenAI successfully got it to do the things they wanted it to do. ChatGPT can be easily jailbroken; if GPT-4 can be too, then it's not aligned with its creators, and if OpenAI couldn't even do that, I highly doubt they'd be able to make it aligned with humanity in general.

@IsaacKing I can't reliably get my kids to do what I want. Are they aligned* with me? Are they aligned* with humanity in general?

*(by your interpretation of Gary Marcus predictions)

predicted YES

@MartinRandall Hmm. Good question. The idea behind "alignment" is that a powerful AI won't destroy the world if given the chance. But plenty of humans would do that too. So I guess yeah, a lot of individual humans are not "aligned with humanity".

Comment hidden

More related questions