On December 20, 2024, OpenAI reported that their o3 reasoning model scored 25.2% on EpochAI's Frontier Math benchmark. For context, AI models like GPT-4 and Gemini score around 2%. Will a Chinese-made AI model surpass that score in 2025?
Resolution Criteria
This market will resolve YES if:
A Chinese company, university, or government entity reports an AI model (e.g. DeepSeek or Qwen) scoring higher than 25.2% on the Frontier Math in 2025
The score is publicly announced and independently verified by EpochAI
The market will resolve NO if:
No Chinese-developed AI model surpasses 25.2% on Frontier Math in 2025
It eventually comes out that a Chinese model created in 2025 surpasses 25.2% on Frontier Math, but this wasn't widely known as of the end of 2025
Other Notes
This market is based on o3's December score of 25.2%. If o3 later surpasses that (for instance, by re-running with more inference compute), the new score won't supersede this one
If there's any uncertainty as to whether a model is "Chinese-made," I'll add clarifications as I see fit. Generally, I'll consider any model whose development was primarily conducted by a Chinese entity to be "Chinese-made"
Models may use any architecture and any amount of compute. I'm also including models that are specifically designed for math or research, not just general LLMs
If Frontier Math changes their benchmark (for instance, by adding a fourth tier of problems), I'll use my best judgement for doing an apples-to-apples comparison. If it doesn't seem possible to fairly compare results, I'll resolve the market at the current price
The model doesn't need to be publicly available, but the score needs to be publicly announced + verified
@Fay42 Would you like to take a larger NO position? I set a limit order at 55%
@Fay42 Curious whether you were taking a no position bc you thought the math models wouldn’t improve fast enough outside OAI, or because you thought they wouldn’t be open-sourced
Combination of both - though less about OpenAI specifically and more about American vs Chinese speeds on frontier benchmarks. I still think Deepseek is in a plausibly bad spot with the new export restrictions but there's a substantial lag between export restrictions + the time at which those export restrictions impact models (since it takes time to get, install, and use gpus).
@Fay42 I think it's very likely that the compute difference requirements between o1 and o3 were small enough that DeepSeek could probably beat o3 on FrontierMath this year with literally no additional compute. (In principle by capabilities, but if the model is open-sourced, I see no reason why Epoch shouldn't test it)
@AdamK It's plausible doing the o3 eval cost hundreds of thousands of dollars, in which case Epoch would need to be willing to spend a lot on doing the FrontierMath eval themselves. I agree that it's plausible deepseek has enough compute to make an o3 equivalent already.
@Fay42 Sure, but the o-series RL paradigm is nowhere close to being scaled. I'm willing to bet that both OAI and DeepSeek will be spending 1-2 OOMs more compute than o3 on RL for individual models by the end of the year. The next reasoning model DeepSeek makes might be comparable to o3 with heavy inference, but the one after won't need nearly as much.
@AdamK I'd bet against Deeepseek doing 1-2 OoMs more than o3 within a year, but idk how to resolve such a bet. And note that they have to spend the compute, train the model, and then have it's inference be possibly an OoM cheaper for the same o3 level results. Though, there are a bunch of other possible paths to a Yes resolution on this market so idk.
@Fay42 I'm also not sure how to resolve. I do think you're either/both underestimating how much compute DeepSeek has/will have, and/or how little RL compute it likely took to make o3
@TamayBesiroglu of Epoch AI on Twitter:
https://x.com/deliprao/status/1880946518980469081?t=2H1RvFazcl8ce0dxOluBSw&s=19
@TamayBesiroglu @ElliotGlazer Would be curious to hear if you have a policy (in mind or publicly stated somewhere) for which models will be evaluated on Frontier Math? It might be nice to commit to evaluating e.g. the apparent SotA open-source LLM on a quarterly basis.
Sorry, who says that EpochAI will even share their problems with Chinese AI companies? Trump is about to be President. China-US relations are probably not good and will likely get worse. People are concerned about fraud and such. Epoch might not trust China to leak the problems.
@nathanwei I think agree that this is the most plausible path to a NO resolution. I do think there is a very high chance that a Chinese AI will exist before 2026 that is in principle capable of beating o3's score; the main question is how they would interface with Epoch