Will an AI score over 30% on FrontierMath Benchmark in 2025
22
160Ṁ3543
resolved Feb 20
Resolved
YES

"Today we're launching FrontierMath, a benchmark for evaluating advanced mathematical reasoning in AI. We collaborated with 60+ leading mathematicians to create hundreds of original, exceptionally challenging math problems, of which current AI systems solve less than 2%.
Existing math benchmarks like GSM8K and MATH are approaching saturation, with AI models scoring over 90%—partly due to data contamination. FrontierMath significantly raises the bar. Our problems often require hours or even days of effort from expert mathematicians.
We evaluated six leading models, including Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro. Even with extended thinking time (10,000 tokens), Python access, and the ability to run experiments, success rates remained below 2%—compared to over 90% on traditional benchmarks."

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ1,060
2Ṁ460
3Ṁ390
4Ṁ195
5Ṁ139
Sort by:
opened a Ṁ300 YES at 99.0% order

@sponge They reached 32% with o3 mini using a Python tool, so this can resolve YES.
OpenAI o3-mini | OpenAI

@sponge Any reason not to resolve this yet?

@mods I believe this should be resolved.

I'm confused by OpenAI's claims vs the original FrontierMath paper. They are claiming 5.8% (https://openai.com/index/openai-o3-mini/) on o1 mini pass@1 while frontiermath paper had it at under 2%. Pass@8 is nearly 13%, implying this test is much "easier" than the original paper claims.

Are these evaluations being done consistently?

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules