It depends what capabilities you do target

 generalisation from {training LLMs with RL on tasks with a verifier} to performing better on tasks without one remains unclear even after two years of trying. 

Grok 4 was apparently a major test of scaling RLVR training. It gets excellent benchmark results and the distilled versions 

 being used at scale. But imo it is the most jagged of all models. 

(One wrinkle is that you can try to RL up some quite general task (like "thinking out loud" or "backtracking"), and this is probably what happened with the initial o1 breakthrough. I would count this as off-target improvement.)

: at the end of next year, will I put >66% that RLVR does improve off-target capabilities?

If you want to use a model of me as well as your model of RLVR to answer, 

Related questions