The best performance by an AI system on FrontierMath as of December 31st 2026.
Which AI systems count?
Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.
Tool assistance, scaffolding, and any other inference-time elicitation techniques are permitted as long as:
There is no systematic unfair advantage over the humans described in the Human Performance section (e.g. AI systems are allowed to have multiple outputs autograded while humans aren't, or AI systems have access to the internet when humans don't).
Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level
The PASS@k elicitation technique (which automatically grades and chooses the best out of k outputs from a model) is a common example that we do not accept on this benchmark because mathematicians are generally evaluated on their ability to generate a single correct answer, not multiple answers to be automatically graded. So PASS@k would consititute an unfair advantage.
If there is evidence of training contamination leading to substantially increased performance, scores will be accordingly adjusted or disqualified.
(Much of the resolution is modified from AI Digest's excellent
/Manifold/what-will-be-the-best-performance-o-A58Ld8LZZL )
@mathvc I'll at least give it some hours and possible a few days to make sure it's best to resolve YES now or when it releases or whatever. I'm not sure yet tbc but I think it would be weird to resolve YES in 5 minutes off of the announcement. idk what yall think about this
@mathvc I think that's incorrect. The dark blue bar seems to indicate a pass@k evaluation technique, see here. In the video, they say "in aggressive test time settings we're able to get over 25%", which suggests that they're using an evaluation technique that's much more permissive than the one that Epoch had in mind. Eyeballing this chart, I think o3 gets about 6% on FrontierMath.
@EricNeyman no. Agressive test-time setting means that they generate longer chains of thoughts. Possibly 100 times longer. But they give only one answer and the answer/proof is correct
@Bayesian Cool -- if so, my next question is whether this criterion is satisfied:
"Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level"
My friend speculates that they spent about $20k/problem, but that's just speculation. That's probably more than you'd need to pay a mathematician (especially for the easier problems, which are presumably the ones that o3 got right).
Do you think someone at OpenAI would be willing to tell us how much compute they used?
@EricNeyman I think they wanted to keep it private so I'd guess they won't say, but yeah $20k+ seems reasonable. hmmm
@jim this inference strategy called majority voting. It is quite different from pass@1. OpenAI did pass@1
@jim I don't think it's true that it would cost much more than $20k/problem to get a human to get 25%+ on FrontierMath. I suspect a top grad student could be paid $1k/problem (a week of work) or less to get 25%.
@Bayesian I'm not accusing anyone of cheating. It's just very hard to make guarantees about what's in your training set when it consists of the entire internet. This kind of leakage isn't new; it's rather the standard thing to expect, and has happened many times before, see e.g. https://arxiv.org/abs/2410.05229v1
IMO one of the most exciting and challenging frontiers in AI is how to make robust evals for these models. Any fixed, finite set of O(1,000) or even O(10,000) questions is dead in the water in my view. We need the ability to generate billions of questions of a certain format at will, like in the GSM-Symbolic paper linked above. This is very hard; how do you generate billions of advanced math questions that are still solvable by human mathematicians in O(days)?
@pietrokc These benchmarks are literally private and known by only a few people that make them very secure
@jim a 14k limit order at 50%@YES, pinging in case you're interested