This question is meant to measure the gap between solving the main math-based benchmarks at the time of market creation, and contributing to real world mathematics.
FrontierMath Tier-4 is an even harder version of FrontierMath - do we need something even harder to fully close the benchmark gap?
I will accept the AI being a (co) first author, or an AI being credited with significant contributions to both deciding what to prove and the actual proof (merely contributing to the proof is not enough - I am trying to get at "the AI does the work of a mathematician" not "the AI does the work of a proof assistant"). I would also accept, for instance, the human author of the paper expressing that they would have named the AI as a co first author if it was human, or saying that the result could not have been obtained without the assistance of the AI.
If a model publishes a paper before it achieves this score, I'll resolve to the 0 bucket.
Update 2025-07-16 (PST) (AI summary of creator comment): In response to user feedback, the creator has acknowledged that the resolution criterion "or saying that the result could not have been obtained without the assistance of the AI" may be interpreted differently than its literal meaning implies.