Will some DPO variant more or less replace RLHF by June 2024?
108
1.8k
1.2k
Jun 2
94%
chance

Nora Belrose says: I predict with 60% confidence that some DPO variant will more or less replace RLHF within 6 months. Outside of huge labs that can afford RLHF’s implementation complexity and instability it’s more like 80% chance.

Given the major labs probably won't talk, we will consider the non-huge lab scenario.

This resolves to YES if, excluding DM/OAI/Anthropic, DPO is a more popular technique in practice than RLHF at time of resolution.

I expect the answer to be obvious one way or another, if not I will attempt to settle via Twitter poll, if that isn't definitive I will ask experts and use best judgment.

https://twitter.com/norabelrose/status/1728456414535016536

Get Ṁ200 play money
Sort by:

Llama 3 was released today and uses DPO: "Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO)." source: https://ai.meta.com/blog/meta-llama-3/

@StephenMcAleese Yeah. Tbh I was a little sad to see they were mixing in PPO as well. Not clear how important each component is.

bought Ṁ200 YES

I feel like this market is getting weighed down somewhat by people failing to read the resolution criteria fully— this is excluding the big labs, so afaict DPO has already more or less replaced RLHF in the relevant domain. IMO in order for this market to resolve No there would need to be some kind of RLHF renaissance outside of the big labs, which seems pretty unlikely.

FWIW my original prediction actually included the big labs, which is why I only said 60% confidence.

@NoraBelrose I agree completely. Mistral, DeepSeek, Qwen, EleutherAI, and Nous Research are all on DPO, it has swept the HF OpenLLM Leaderboard, and has been substantially more popular with indie developers since day 1.

It looks to me like this is de facto resolved "yes" already.

@StellaBiderman I am almost certainly not going to resolve this one early, though.

@ZviMowshowitz can you say more about the denominator here? Are you targeting the popularity of DPO in terms of the proportion of models built (anywhere; publicly known or private; commercial or non-commercial; during June 2024) that use DPO out of those that use DPO or RLHF? Models that are publicly known? Models at the tops of leaderboards? Models in ML papers? Some weighted version of the above? The number of ML engineers who are using DPO vs RLHF in June 2024? Etc.

Personally, I feel confident in YES for some of these but NO for others.

@Jacy The spirit of the question is that this applies to DPO is being used more in practice at the time for currently training models, as best we can tell. It does not only apply to models that have been deployed.

As noted, my intention if this is unclear at the time is to outsource the answer to experts or a poll as seems best.

bought Ṁ350 of NO

I started bearing on Gemini when I learned it didn't use DPO. I think DPO is the future, but given that no major lab is super likely to release a new model, I don't think the open source models have the corpus to scale it.

predicts YES

@RobertKennedy Mistral used DPO

bought Ṁ40 of NO

This resolves NO if they more or less just incorporate DPO in some part of their training without replacing RLHF, right?

@adjo It's about which is a more popular technique, so a hybrid approach would be weighed accordingly.

For the purpose of this question, do you consider PsiPO or IPO (http://arxiv.org/abs/2310.12036) to be "DPO variants"? PsiPO is a generalisation of DPO, and IPO is another special case of PsiPO.

bought Ṁ0 of NO

10k of limit orders between 60% and 85%

Other operationalization of this:

@jskf Yep, I think that's a very different question because of what causes top of leaderboard...

More related questions