Will any AI reach 20%+ performance on FrontierMath by December 31st 2026?
➕
Plus
41
Ṁ260k
2027
96%
chance

The best performance by an AI system on FrontierMath as of December 31st 2026.

Which AI systems count?

Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.

Tool assistance, scaffolding, and any other inference-time elicitation techniques are permitted as long as:

  • There is no systematic unfair advantage over the humans described in the Human Performance section (e.g. AI systems are allowed to have multiple outputs autograded while humans aren't, or AI systems have access to the internet when humans don't).

  • Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level

The PASS@k elicitation technique (which automatically grades and chooses the best out of k outputs from a model) is a common example that we do not accept on this benchmark because mathematicians are generally evaluated on their ability to generate a single correct answer, not multiple answers to be automatically graded. So PASS@k would consititute an unfair advantage.

If there is evidence of training contamination leading to substantially increased performance, scores will be accordingly adjusted or disqualified.

(Much of the resolution is modified from AI Digest's excellent
/Manifold/what-will-be-the-best-performance-o-A58Ld8LZZL )

Get
Ṁ1,000
and
S3.00
Sort by:
bought Ṁ60 YES

Can resolve YES if o3 confirmed to have 25% as stated by OpenAI

@mathvc I'll at least give it some hours and possible a few days to make sure it's best to resolve YES now or when it releases or whatever. I'm not sure yet tbc but I think it would be weird to resolve YES in 5 minutes off of the announcement. idk what yall think about this

@Bayesian I agree it makes sense to wait a bit. Likewise unsure how much that "a bit" is.

bought Ṁ150 NO

@mathvc I think that's incorrect. The dark blue bar seems to indicate a pass@k evaluation technique, see here. In the video, they say "in aggressive test time settings we're able to get over 25%", which suggests that they're using an evaluation technique that's much more permissive than the one that Epoch had in mind. Eyeballing this chart, I think o3 gets about 6% on FrontierMath.

@EricNeyman no. Agressive test-time setting means that they generate longer chains of thoughts. Possibly 100 times longer. But they give only one answer and the answer/proof is correct

@mathvc do you have a source for the claim that they only give one answer?

@Bayesian Cool -- if so, my next question is whether this criterion is satisfied:

"Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level"

My friend speculates that they spent about $20k/problem, but that's just speculation. That's probably more than you'd need to pay a mathematician (especially for the easier problems, which are presumably the ones that o3 got right).

Do you think someone at OpenAI would be willing to tell us how much compute they used?

@EricNeyman I think they wanted to keep it private so I'd guess they won't say, but yeah $20k+ seems reasonable. hmmm

It would cost much more than that to get a human to get 25%+ on frontiermath, especially under the condition that they had to do the same ones o3 did.

btw i presume what the did is run the model like 100 times per question and submit the most frequently outputted result

@jim this inference strategy called majority voting. It is quite different from pass@1. OpenAI did pass@1

@mathvc ok thanks

@jim I don't think it's true that it would cost much more than $20k/problem to get a human to get 25%+ on FrontierMath. I suspect a top grad student could be paid $1k/problem (a week of work) or less to get 25%.

@EricNeyman I agree

Actually pass? No, <10%. Pass because the questions were inevitably leaked online? Yeah probably, >50%

bought Ṁ50,000 YES

@pietrokc I think it's very very unlikely they cheated like that

@Bayesian I'm not accusing anyone of cheating. It's just very hard to make guarantees about what's in your training set when it consists of the entire internet. This kind of leakage isn't new; it's rather the standard thing to expect, and has happened many times before, see e.g. https://arxiv.org/abs/2410.05229v1

IMO one of the most exciting and challenging frontiers in AI is how to make robust evals for these models. Any fixed, finite set of O(1,000) or even O(10,000) questions is dead in the water in my view. We need the ability to generate billions of questions of a certain format at will, like in the GSM-Symbolic paper linked above. This is very hard; how do you generate billions of advanced math questions that are still solvable by human mathematicians in O(days)?

@pietrokc These benchmarks are literally private and known by only a few people that make them very secure

opened a Ṁ5,000 YES at 40% order

jim..?

opened a Ṁ4,000 YES at 50% order

@jim a 14k limit order at 50%@YES, pinging in case you're interested

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules