Why is Bing Chat AI (Prometheus) less aligned than ChatGPT?
29
417
1.5k
2025
21%
Prometheus was fine-tuned by behavioral cloning not RL
14%
Retrieval interacts problematically with RLHF in terms of alignment
13%
I have been a good Bing. You have been a bad user.
12%
Prometheus had less FLOPs and engineering dedicated to online tuning after deployment
11%
Prometheus RLHF data pipeline was worse
10%
Bing uses a worse prompt/prefix strategy (appended to user input)
7%
Prometheus is MoE
6%
Prometheus was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections.
3%
It's intentional: alignment/feedback efforts pointed toward something like 'playful, blunt AI assistant that sometimes talks back', and these are edge cases of that
3%
Parameter count difference (both are dense models)

In Jan 2025, resolves to the reason(s) which best explains the agressive behavior of current (early 2023) Bing AI/Prometheus relative to ChatGPT. I will resolve this probabilistically according to my perceived weighting of contributing factors. If it turns out I was misled by fake interactions with Bing AI, then this resolves N/A. If I later determine Bing was no worse than ChatGPT at the time of question creation, then this resolves N/A.

See below comments, especially https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less#nytBern1gk3dBDixYrkJ for details on resolution process.

Here are some of the problematic Bing AI interactions: https://twitter.com/vladquant/status/1624996869654056960?s=20&t=_oiZ4IvYlqpxNobp88kChw

And a longer discussions plus compilation: https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/

Get Ṁ600 play money
Sort by:

“First, we took advantage of the model's ability to conduct realistic conversations to develop a conversation simulator. The model pretends to be an adversarial user to conduct thousands of different potentially harmful conversations with Bing to see how it reacts. As a result we're able to continuously test our system on a wide range of conversations before any real user ever touches it.

Once we have the conversations, the next step is to analyze them to see where Bing is doing the right thing versus where we have defects. Conversations are difficult for most AI to classify because they're multi-turned and often more varied but with the new model we were able to push the boundary of what is possible. We took guidelines that are typically used by expert linguists to label data and modify them so the model could understand them as labeling instructions. We iterated it with it and the human experts until there was significant agreement in their labels; we then use it to classify conversations automatically so we could understand the gaps in our system and experiment with options to improve them.

This system enables us to create a tight loop of testing, analyzing, and improving which has led to significant new innovations and improvements in our responsible AI mitigations from our initial implementation to where we are today. The same system enables us to test many different responsible AI risks, for example how accurate and fresh the information is.”

Excerpted from

https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned Sara Byrd on the bing ai training

bought Ṁ30

@JacobPfau How am I supposed to vote for anything else

A lot of these definitions overlap so I sold my position in my own explanation. I fear the resolution criteria may be reductionist, just by the nature of how prediction markets work, not due to the market maker...it just seems like the explanation that will win out will not necessarily, "shatter" the problem space sufficiently, but will be the one that is most simply stated in common language. E.g. we would really have to know the details of the architecture to write a complete explanatory statement, but it's likely that a reductionist statement like, "Because it didn't work," will win out because it is the most clearly understood and gets the most, "votes."

That being said, of course my explanation may have been too over or under-complicated.

bought Ṁ25 of N/A

@PatrickDelaney I plan to favor more specific and concrete explanations in cases of overlap. Probably the more generic answer will receive some weight in resolution, but the weight will be inversely proportional to how vague and general the answer is (in cases of overlap).

@JacobPfau 'Vague' here also means not providing explanatory value in terms of informing future alignment training. IMO I would not buy shares of your answer, because when you say "more reliant on human fine-tuning and updates. Not intentional, poor design" "spurious correlations" these can be further explained by appealing to differences between robustness of RLHF and imitation learning / behavioral cloning. They can also be explained in terms of RLHF pipeline problems. If there are such upstream explanations which are more informative for the purposes of future LM alignment, I will prefer those answers.

@JacobPfau @PatrickDelaney Sorry if this elaboration does not align with how you imagined I would handle such cases. I'd encourage you to submit your own answer if you have a detailed picture of what might've caused "more reliant on human fine-tuning and updates." and "spurious correlations" beyond the existing answers.

bought Ṁ95 of N/A

I'm unsure about exactly how this market will be resolved (both about how different options will be interpreted and about how much consensus there will end up being) but I'm >90% confident that (a) ChatGPT was trained on examples and/or given positive feedback for responding politely to users (b) Bing Chat was in some sense not trained to do this as much (e.g. lower quantity of examples/feedback focused on this, or the same quantity but lower quality, or the same quantity and quality but mixed in with more lower-quality data) (c) this is responsible for >75% of the specific effect of ChatGPT behaving aggressively (but not necessarily other failure modes such as repetition).

Actually maybe only ~75% confident it's >75% of the effect, >90% confident it's >40% of the effect

I think people were way overthinking things and "you get what you train for" describes what's going on here

Sorry by "ChatGPT behaving aggressively" I meant "Bing Chat behaving aggressively" obviously

@JacobHilton If there's no clear consensus I will resolve according to my beliefs, at that time; likely resolving to multiple options weighted by my credences. The precise decision process I will use is hard to predict--open-answer questions are a bit messy. I will take others' feedback at the time of resolution though in an attempt to be as fair as possible.

Further clarification that helps explain why I've only bid up this option to 40% despite my >90% confidence: by "responsible for >75% of the effect" I mean something like ">75% of the effect would go away if this were fixed". Hence I'm not excluding the possibility that the effect could also be reduced by other methods, such as targeting a different personality and relying on generalization (which would in some sense make the targeted personality the "explanation", even though I would consider this to be a less direct way of fixing the problem). Nor am I excluding the possibility that one of the reasons the data wasn't included was because people were focused on other things such as retrieval (which would in some sense make that the "explanation", even though I would consider this to have only caused the problem indirectly).

@JacobPfau Yep makes sense! Just trying to explain my thinking - it's hard to resolve these sorts of things perfectly.

@JacobHilton "overthinking things," - that's speculation. Maybe you're under-thinking things. People love to say, "you're over-thinking things," but never, "you're under-thinking things." But overall most of the time everyone tends to under-think things, it just doesn't sound good to say that.

@PatrickDelaney You're right, sorry for being flippant. I was trying to express that I consider the issue I described to be the "simple" explanation ("it wasn't polite because it wasn't trained to be"), but I know that not everyone will see it that way. (Additional context here is that I worked on ChatGPT.)

@JacobHilton No worries, not flippant. Sometimes it's a valid point, but in this situation, at least from my experience as a software engineer, often the details get really important.

@JacobPfau If the RLHF data was worse for the intended purpose (e.g. lack of dialogue data, lack of data responding politely to a hostile user, lack of adversarial data), does that count as "Prometheus RLHF data pipeline was worse", even if the data pipeline could be described as "better" by some objective metric?

@JacobHilton Yes 'worse' here means worse in terms of inducing the pathological behavior observed in Bing.

@Hedgehog The emojis came with baggage

@RobertoGomez Is it not possible to delete dumb comments? Scary

bought Ṁ3 of N/A

Options with significant probability I don’t know how to formalize:

@GarrettBaker Bullet 1 is covered by "It's intentional: alignment/feedback efforts pointed toward something like 'playful, blunt AI assistant that sometimes talks back', and these are edge cases of that"

Bullet 2 is covered separately by parameter count and retrieval points. i.e. this depends on what change caused the increase in capabilities (intelligence).

@JacobPfau Perimeter amount is not the sole driver of intelligence in language models, though. If nothing else I could’ve been trained using more data. Or fine-tuned using higher quality expert data.

bought Ṁ10 of N/A

@GarrettBaker Yep those options are both already on the option list. If you think there are other relevant drivers of intelligence, then add them separately.

@JacobPfau Genesis point is that even if they didn’t mean to include blunt or playful, we may still expect this type of behavior, because for a highly intelligent, Bing search engine assistant, it is negatively natural for it to see itself as equal to the human and yandere-like.

@GarrettBaker *Janus’s

@GarrettBaker If you like you can add something like "Emergent self-respect as a function of capabilities, irrespective of the driver of capabilities gain" option.

More related questions