o1-style models like OpenAI's o1 (https://openai.com/o1/) or DeepSeeks R1 (https://x.com/deepseek_ai) are generally much better at math problems, one-off coding problems, and so on than models not trained with RL over chain-of-thought.
However, such models generally seem much worse at other things -- they seem to work less well with pre-existing code; with verbal non-math tasks; with doing some translations; and so on. This is easily confirmed through vibes-based tests.
This question resolves positively if, on January 1st 2027, this tradeoff has been resolved. If there's a single model that can handle all the AIME, MATH questions that you like -- and which also you would be totally happy to have handle any other kind of question.
The question resolves negatively if, like today, on January 1st 2027 you'd prefer to use o1-like-models for some large class of tasks, but non-o1-like-models-like Sonnet or 405b for some other class of tasks. That is, if it's the consensus (basically) that there is a large tradeoff between different classes of models.
Basically:
- Positive -- RL v non-RL tradeoff has been eliminated (even if Anthropic's models are still better at, say, scholastic philosophy than OpenAIs).
- Negative -- RL v. non-RL tradeoff has not been eliminated
(This is one way of trying to operationalize future predictions of "superintelligence")