Is “frying” models with excess RL (harming off-target capabilities) real? If it is, is it just due to temporary incompetence by human scientists?
For instance, RLHF (which is not real RL) was well-known to damage capabilities. Over the last few years this seems to have gotten less severe.
It's plausible to me that one reason few people use Grok despite the good benchmarks is overtraining/incompetent training in this way.
Resolution: at the end of next year, will I put >66% that off-target capabilities in the flagship OpenAI and Anthropic models are as harmed or more harmed by RL than they are now?
My current credence (Dec 2025) in "Does RL harm off-target capabilities?": 70%
If you want to use a model of me as well as your model of RLVR to answer, here are some of my views.