Does RL training with verifiers help on tasks without a verifier?
2
1kṀ150
2026
48%
chance

The feared / hoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing better on tasks without one remains unclear even after two years of trying.

Grok 4 was apparently a major test of scaling RLVR training. It gets excellent benchmark results and the distilled versions are actually being used at scale. But imo it is the most jagged of all models. 

(One wrinkle is that you can try to RL up some quite general task (like "thinking out loud" or "backtracking"), and this is probably what happened with the initial o1 breakthrough. I would count this as off-target improvement.)

Resolution: at the end of next year, will I put >66% that RLVR does improve off-target capabilities?

My current credence (Dec 2025): 30%

If you want to use a model of me as well as your model of RLVR to answer, here are some of my views.

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy