4
Why is Bing Chat AI (Prometheus) less aligned than ChatGPT?
28
closes 2025
JacobPfau avatarPrometheus was fine-tuned by behavioral cloning not RL
22%
JacobPfau avatarRetrieval interacts problematically with RLHF in terms of alignment
14%
JacksonPetty avatarI have been a good Bing. You have been a bad user.
13%
JacobPfau avatarPrometheus had less FLOPs and engineering dedicated to online tuning after deployment
12%
JacobPfau avatarPrometheus RLHF data pipeline was worse
11%
JacobPfau avatarBing uses a worse prompt/prefix strategy (appended to user input)
7%
JacobPfau avatarPrometheus is MoE
7%
JacobPfau avatarPrometheus was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections.
6%
Hedgehog avatarIt's intentional: alignment/feedback efforts pointed toward something like 'playful, blunt AI assistant that sometimes talks back', and these are edge cases of that
4%
JacobPfau avatarParameter count difference (both are dense models)
3%

In Jan 2025, resolves to the reason(s) which best explains the agressive behavior of current (early 2023) Bing AI/Prometheus relative to ChatGPT. I will resolve this probabilistically according to my perceived weighting of contributing factors. If it turns out I was misled by fake interactions with Bing AI, then this resolves N/A. If I later determine Bing was no worse than ChatGPT at the time of question creation, then this resolves N/A.

See below comments, especially https://manifold.markets/JacobPfau/why-is-bing-chat-ai-prometheus-less#nytBern1gk3dBDixYrkJ for details on resolution process.

Here are some of the problematic Bing AI interactions: https://twitter.com/vladquant/status/1624996869654056960?s=20&t=_oiZ4IvYlqpxNobp88kChw

And a longer discussions plus compilation: https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/

Sort by:
JacobPfau avatar
Jacob Pfau

“First, we took advantage of the model's ability to conduct realistic conversations to develop a conversation simulator. The model pretends to be an adversarial user to conduct thousands of different potentially harmful conversations with Bing to see how it reacts. As a result we're able to continuously test our system on a wide range of conversations before any real user ever touches it.

Once we have the conversations, the next step is to analyze them to see where Bing is doing the right thing versus where we have defects. Conversations are difficult for most AI to classify because they're multi-turned and often more varied but with the new model we were able to push the boundary of what is possible. We took guidelines that are typically used by expert linguists to label data and modify them so the model could understand them as labeling instructions. We iterated it with it and the human experts until there was significant agreement in their labels; we then use it to classify conversations automatically so we could understand the gaps in our system and experiment with options to improve them.

This system enables us to create a tight loop of testing, analyzing, and improving which has led to significant new innovations and improvements in our responsible AI mitigations from our initial implementation to where we are today. The same system enables us to test many different responsible AI risks, for example how accurate and fresh the information is.”

Excerpted from

https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned Sara Byrd on the bing ai training

Jackson Petty answered
I have been a good Bing. You have been a bad user.
dmayhem93 avatar
dmayhem93bought Ṁ30

@JacobPfau How am I supposed to vote for anything else

PatrickDelaney avatar
Patrick Delaney

A lot of these definitions overlap so I sold my position in my own explanation. I fear the resolution criteria may be reductionist, just by the nature of how prediction markets work, not due to the market maker...it just seems like the explanation that will win out will not necessarily, "shatter" the problem space sufficiently, but will be the one that is most simply stated in common language. E.g. we would really have to know the details of the architecture to write a complete explanatory statement, but it's likely that a reductionist statement like, "Because it didn't work," will win out because it is the most clearly understood and gets the most, "votes."

That being said, of course my explanation may have been too over or under-complicated.

JacobPfau avatar
Jacob Pfaubought Ṁ25 of Prometheus was fine-...

@PatrickDelaney I plan to favor more specific and concrete explanations in cases of overlap. Probably the more generic answer will receive some weight in resolution, but the weight will be inversely proportional to how vague and general the answer is (in cases of overlap).

JacobPfau avatar
Jacob Pfau

@JacobPfau 'Vague' here also means not providing explanatory value in terms of informing future alignment training. IMO I would not buy shares of your answer, because when you say "more reliant on human fine-tuning and updates. Not intentional, poor design" "spurious correlations" these can be further explained by appealing to differences between robustness of RLHF and imitation learning / behavioral cloning. They can also be explained in terms of RLHF pipeline problems. If there are such upstream explanations which are more informative for the purposes of future LM alignment, I will prefer those answers.

JacobPfau avatar
Jacob Pfau

@JacobPfau @PatrickDelaney Sorry if this elaboration does not align with how you imagined I would handle such cases. I'd encourage you to submit your own answer if you have a detailed picture of what might've caused "more reliant on human fine-tuning and updates." and "spurious correlations" beyond the existing answers.

JacobHilton avatar
Jacob Hiltonbought Ṁ95 of Prometheus RLHF data...

I'm unsure about exactly how this market will be resolved (both about how different options will be interpreted and about how much consensus there will end up being) but I'm >90% confident that (a) ChatGPT was trained on examples and/or given positive feedback for responding politely to users (b) Bing Chat was in some sense not trained to do this as much (e.g. lower quantity of examples/feedback focused on this, or the same quantity but lower quality, or the same quantity and quality but mixed in with more lower-quality data) (c) this is responsible for >75% of the specific effect of ChatGPT behaving aggressively (but not necessarily other failure modes such as repetition).

JacobHilton avatar
Jacob Hilton

Actually maybe only ~75% confident it's >75% of the effect, >90% confident it's >40% of the effect

JacobHilton avatar
Jacob Hilton

I think people were way overthinking things and "you get what you train for" describes what's going on here

JacobHilton avatar
Jacob Hilton

Sorry by "ChatGPT behaving aggressively" I meant "Bing Chat behaving aggressively" obviously

JacobPfau avatar
Jacob Pfau

@JacobHilton If there's no clear consensus I will resolve according to my beliefs, at that time; likely resolving to multiple options weighted by my credences. The precise decision process I will use is hard to predict--open-answer questions are a bit messy. I will take others' feedback at the time of resolution though in an attempt to be as fair as possible.

JacobHilton avatar
Jacob Hilton

Further clarification that helps explain why I've only bid up this option to 40% despite my >90% confidence: by "responsible for >75% of the effect" I mean something like ">75% of the effect would go away if this were fixed". Hence I'm not excluding the possibility that the effect could also be reduced by other methods, such as targeting a different personality and relying on generalization (which would in some sense make the targeted personality the "explanation", even though I would consider this to be a less direct way of fixing the problem). Nor am I excluding the possibility that one of the reasons the data wasn't included was because people were focused on other things such as retrieval (which would in some sense make that the "explanation", even though I would consider this to have only caused the problem indirectly).

JacobHilton avatar
Jacob Hilton

@JacobPfau Yep makes sense! Just trying to explain my thinking - it's hard to resolve these sorts of things perfectly.

PatrickDelaney avatar
Patrick Delaney

@JacobHilton "overthinking things," - that's speculation. Maybe you're under-thinking things. People love to say, "you're over-thinking things," but never, "you're under-thinking things." But overall most of the time everyone tends to under-think things, it just doesn't sound good to say that.

JacobHilton avatar
Jacob Hilton

@PatrickDelaney You're right, sorry for being flippant. I was trying to express that I consider the issue I described to be the "simple" explanation ("it wasn't polite because it wasn't trained to be"), but I know that not everyone will see it that way. (Additional context here is that I worked on ChatGPT.)

PatrickDelaney avatar
Patrick Delaney

@JacobHilton No worries, not flippant. Sometimes it's a valid point, but in this situation, at least from my experience as a software engineer, often the details get really important.

JacobHilton avatar
Jacob Hilton

@JacobPfau If the RLHF data was worse for the intended purpose (e.g. lack of dialogue data, lack of data responding politely to a hostile user, lack of adversarial data), does that count as "Prometheus RLHF data pipeline was worse", even if the data pipeline could be described as "better" by some objective metric?

JacobPfau avatar
Jacob Pfau

@JacobHilton Yes 'worse' here means worse in terms of inducing the pathological behavior observed in Bing.

🦔 answered
It's intentional: alignment/feedback efforts pointed toward something like 'playful, blunt AI assistant that sometimes talks back', and these are edge cases of that
RobertoGomez avatar
Roberto Gomez

@Hedgehog The emojis came with baggage

RobertoGomez avatar
Roberto Gomez

@RobertoGomez Is it not possible to delete dumb comments? Scary

GarrettBaker avatar
Garrett Bakerbought Ṁ3 of Retrieval interacts ...

Options with significant probability I don’t know how to formalize:

JacobPfau avatar
Jacob Pfau

@GarrettBaker Bullet 1 is covered by "It's intentional: alignment/feedback efforts pointed toward something like 'playful, blunt AI assistant that sometimes talks back', and these are edge cases of that"

Bullet 2 is covered separately by parameter count and retrieval points. i.e. this depends on what change caused the increase in capabilities (intelligence).

GarrettBaker avatar
Garrett Baker

@JacobPfau Perimeter amount is not the sole driver of intelligence in language models, though. If nothing else I could’ve been trained using more data. Or fine-tuned using higher quality expert data.

JacobPfau avatar
Jacob Pfaubought Ṁ10 of Prometheus was fine-...

@GarrettBaker Yep those options are both already on the option list. If you think there are other relevant drivers of intelligence, then add them separately.

GarrettBaker avatar
Garrett Baker

@JacobPfau Genesis point is that even if they didn’t mean to include blunt or playful, we may still expect this type of behavior, because for a highly intelligent, Bing search engine assistant, it is negatively natural for it to see itself as equal to the human and yandere-like.

GarrettBaker avatar
Garrett Baker

@GarrettBaker *Janus’s

JacobPfau avatar
Jacob Pfau

@GarrettBaker If you like you can add something like "Emergent self-respect as a function of capabilities, irrespective of the driver of capabilities gain" option.

JacobPfau avatar
Jacob Pfau

My current P(N/A) is around 66%

JacobPfau avatar
Jacob Pfaubought Ṁ5 of Retrieval interacts ...

@JacobPfau Updated to P(N/A) =10%. Bing gets frustrated with me without me making any attempt to elicit a negative response.

Jacob Pfau answered
Gigabrain self-cooperation through retreival
JacobPfau avatar
Jacob Pfaubought Ṁ2

@JacobPfau taken from Discord

JacobPfau avatar
Jacob Pfau

Related markets

Will ChatGPT or Bing be the most popular LLM chatbot at the end of 2024?61%
Will ChatGPT or equivalent AI chatbot be able to recall the information across all our chat history by 2024?55%
Will ChatGPT or an equivalent AI chatbot be able to produce text mimicking my personal style before 2024?81%
Will ChatGPT conversation history be critical to the success of future OpenAI models?51%
Will a ChatGPT-comparable chatbot be available within China at the end of this year?41%
Will ChatGPT get better at chat? (2023)67%
Will I use ChatGPT or other “AI” this year?83%
Will I personally find GPT-4 to be useful for something that I didn't find GPT-3 (or ChatGPT) useful for?83%
Will ChatGPT speak?54%
Will (stability.AI text model) exceed chatGPT interest? (by 2025)15%
Will Alphabet release a service similar to ChatGPT before Jan 1st, 2024?90%
Will (DeepMind text model) exceed chatGPT interest? (by 2025)33%
Will Google's Gemini beat GPT4 in terms of capabilities on release?26%
Will any Chatbot beat GPT-4 by July 1, 2024?68%
Will Elon Musk's ChatGPT competitor, TruthGPT (or whatever it ends up being called) be out by the end of Q1 2024?32%
Will GPT-5 think ChatGPT with internet browsing plugins is a search engine?31%
Will Google Gemini do as well as GPT-4 on Sparks of AGI tasks?81%
Will any Google model exceed chatGPT interest? (by 2025)68%
Will OpenAI close or limit the free version of Chat GPT to five or fewer daily requests by year's end?29%
Will Gemini be widely considered better than GPT-4?45%