Will some DPO variant more or less replace RLHF by June 2024?

109

1.2kṀ13k

resolved Jun 7

Resolved

YES

ALL

Nora Belrose says: I predict with 60% confidence that some DPO variant will more or less replace RLHF within 6 months. Outside of huge labs that can afford RLHF’s implementation complexity and instability it’s more like 80% chance.

Given the major labs probably won't talk, we will consider the non-huge lab scenario.

This resolves to YES if, excluding DM/OAI/Anthropic, DPO is a more popular technique in practice than RLHF at time of resolution.

I expect the answer to be obvious one way or another, if not I will attempt to settle via Twitter poll, if that isn't definitive I will ask experts and use best judgment.

https://twitter.com/norabelrose/status/1728456414535016536

Technology

Technical AI Timelines

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ706
2		Ṁ575
3		Ṁ232
4		Ṁ223
5		Ṁ199

People are also trading

Will a COVID-19 variant cause a new scare and intermittent lockdowns in multiple countries by end 2025?

1% chance

Will a single model have all the upsides o1-style RL with none of the downsides at 2027?

Sort by:

I am going to resolve this YES unless someone thinks that is wrong?

@ZviMowshowitz maybe use the @traders thing next time so we see the notification. I get the impression there's been this cambrian explosion of preference optimization methods. Not sure which labs/models are still primarily using RLHF

Llama 3 was released today and uses DPO: "Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO)." source: https://ai.meta.com/blog/meta-llama-3/

@StephenMcAleese Yeah. Tbh I was a little sad to see they were mixing in PPO as well. Not clear how important each component is.

bought Ṁ200 YES

I feel like this market is getting weighed down somewhat by people failing to read the resolution criteria fully— this is excluding the big labs, so afaict DPO has already more or less replaced RLHF in the relevant domain. IMO in order for this market to resolve No there would need to be some kind of RLHF renaissance outside of the big labs, which seems pretty unlikely.

FWIW my original prediction actually included the big labs, which is why I only said 60% confidence.

@NoraBelrose I agree completely. Mistral, DeepSeek, Qwen, EleutherAI, and Nous Research are all on DPO, it has swept the HF OpenLLM Leaderboard, and has been substantially more popular with indie developers since day 1.

It looks to me like this is de facto resolved "yes" already.

@StellaBiderman I am almost certainly not going to resolve this one early, though.

@ZviMowshowitz can you say more about the denominator here? Are you targeting the popularity of DPO in terms of the proportion of models built (anywhere; publicly known or private; commercial or non-commercial; during June 2024) that use DPO out of those that use DPO or RLHF? Models that are publicly known? Models at the tops of leaderboards? Models in ML papers? Some weighted version of the above? The number of ML engineers who are using DPO vs RLHF in June 2024? Etc.

Personally, I feel confident in YES for some of these but NO for others.

@Jacy The spirit of the question is that this applies to DPO is being used more in practice at the time for currently training models, as best we can tell. It does not only apply to models that have been deployed.

As noted, my intention if this is unclear at the time is to outsource the answer to experts or a poll as seems best.

I started bearing on Gemini when I learned it didn't use DPO. I think DPO is the future, but given that no major lab is super likely to release a new model, I don't think the open source models have the corpus to scale it.

predictedYES

@RobertKennedy Mistral used DPO

predictedYES

https://x.com/stanfordnlp/status/1741871845878772197?s=46

https://x.com/stanfordnlp/status/1746959144480063663?s=46

This resolves NO if they more or less just incorporate DPO in some part of their training without replacing RLHF, right?

@adjo It's about which is a more popular technique, so a hybrid approach would be weighed accordingly.

For the purpose of this question, do you consider PsiPO or IPO (http://arxiv.org/abs/2310.12036) to be "DPO variants"? PsiPO is a generalisation of DPO, and IPO is another special case of PsiPO.

@Riemann Yes.

10k of limit orders between 60% and 85%

Other operationalization of this:

@jskf Yep, I think that's a very different question because of what causes top of leaderboard...

People are also trading

Will a COVID-19 variant cause a new scare and intermittent lockdowns in multiple countries by end 2025?

1% chance

Will a single model have all the upsides o1-style RL with none of the downsides at 2027?

58% chance

🏅 Top traders

People are also trading

People are also trading

Related questions