Nora Belrose says: I predict with 60% confidence that some DPO variant will more or less replace RLHF within 6 months. Outside of huge labs that can afford RLHF’s implementation complexity and instability it’s more like 80% chance.
Given the major labs probably won't talk, we will consider the non-huge lab scenario.
This resolves to YES if, excluding DM/OAI/Anthropic, DPO is a more popular technique in practice than RLHF at time of resolution.
I expect the answer to be obvious one way or another, if not I will attempt to settle via Twitter poll, if that isn't definitive I will ask experts and use best judgment.
@ZviMowshowitz maybe use the @traders thing next time so we see the notification. I get the impression there's been this cambrian explosion of preference optimization methods. Not sure which labs/models are still primarily using RLHF
Llama 3 was released today and uses DPO: "Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO)." source: https://ai.meta.com/blog/meta-llama-3/
@StephenMcAleese Yeah. Tbh I was a little sad to see they were mixing in PPO as well. Not clear how important each component is.
I feel like this market is getting weighed down somewhat by people failing to read the resolution criteria fully— this is excluding the big labs, so afaict DPO has already more or less replaced RLHF in the relevant domain. IMO in order for this market to resolve No there would need to be some kind of RLHF renaissance outside of the big labs, which seems pretty unlikely.
FWIW my original prediction actually included the big labs, which is why I only said 60% confidence.
@NoraBelrose I agree completely. Mistral, DeepSeek, Qwen, EleutherAI, and Nous Research are all on DPO, it has swept the HF OpenLLM Leaderboard, and has been substantially more popular with indie developers since day 1.
It looks to me like this is de facto resolved "yes" already.
@ZviMowshowitz can you say more about the denominator here? Are you targeting the popularity of DPO in terms of the proportion of models built (anywhere; publicly known or private; commercial or non-commercial; during June 2024) that use DPO out of those that use DPO or RLHF? Models that are publicly known? Models at the tops of leaderboards? Models in ML papers? Some weighted version of the above? The number of ML engineers who are using DPO vs RLHF in June 2024? Etc.
Personally, I feel confident in YES for some of these but NO for others.
@Jacy The spirit of the question is that this applies to DPO is being used more in practice at the time for currently training models, as best we can tell. It does not only apply to models that have been deployed.
As noted, my intention if this is unclear at the time is to outsource the answer to experts or a poll as seems best.
@adjo It's about which is a more popular technique, so a hybrid approach would be weighed accordingly.
For the purpose of this question, do you consider PsiPO or IPO (http://arxiv.org/abs/2310.12036) to be "DPO variants"? PsiPO is a generalisation of DPO, and IPO is another special case of PsiPO.
10k of limit orders between 60% and 85%