One interesting research programme in 2025 suggests that RL on verifiable rewards (RLVR) actually doesn't add capability to a base model, but instead makes it easier to elicit existing capabilities.
Implications: If so, the NVL576 bottleneck will slow down the whole field; RLVR economics won't get to amortise over a very long period; the question of the power of distillation and synthetic pretraining becomes crucial. See this comment for a nice model.
Resolution: at the end of next year, will I put >66% that RLVR is bottlenecked on capabilities learned during pretraining?
My current credence (Dec 2025): 30%
If you want to use a model of me as well as your model of RLVR to answer, here are some of my views.
Made a large YES bet, but I would be sad if I won the bet because "the biggest bottleneck is capabilities learned during pretraining" is technically true but the technical truth of that statement leads you to believe false things. Notably, even if that bottleneck exists
1. We can pretrain models on data other than "random human webtext". The publicly-shared LLM trajectories of today can become the pretraining data of tomorrow's LLMs.
2. Distillation is a thing, so it's not like you have to throw away all your hard RL work on one model when you go to pretrain its successor
3. Even if you did have to throw away all the compute you spent on the final RL run of the previous model, most of the work is in building / validating / debugging the RL pipeline, and you can probably reuse most of that on the successor at much lower cost.
@FaulSname agree, thanks. I'm trying to not read too much into "waow base pass@128 > reasoning pass@128" and will link this comment so readers mood affiliate less.
@GavinLeech I recommend adding some topics to this market to increase its discoverability
PS, relevant: https://www.lesswrong.com/posts/ZtQD8CmQRZKNQFRd3/faul_sname-s-shortform?commentId=ZHrzKtB4p3uZ7uSNk