What will be the best AI performance on Humanity's Last Exam by December 31st 2025?
💎
Premium
25
Ṁ20k
2026
1.1%
0-10%
11%
10-20%
15%
20-30%
12%
30-40%
11%
40-50%
10%
50-60%
11%
60-70%
11%
70-80%
11%
80-90%
7%
90-100%

This market is duplicated from and inspired from

/Manifold/what-will-be-the-best-performance-o-nzPCsqZgPc

The best performance by an AI system on the new Last Exam benchmark as of December 31st 2025.
https://lastexam.ai/


Resolution criteria

Resolves to the best AI performance on the multimodal version of the Last Exam. This resolution will use https://scale.com/leaderboard/humanitys_last_exam as its source, if it remains up to date at the end of 2025. Otherwise, I will use my discretion in determining whether a result should be considered valid.

If the number reported is exactly on the boundary (eg. 10%) then the higher choice will be used (ie. 10-20%).

See also:
/Bayesian/will-o3s-score-on-the-last-exam-be

/Bayesian/which-of-frontiermath-and-humanitys

Get
Ṁ1,000
and
S3.00
Sort by:
bought Ṁ50 YES

Anything 30+ is honestly scary territory. FrontierMath is really impressive and all, and no doubt surprising, but kinda an oh that happened type thing. This is the kind of test that would make me seriously reconsider my beliefs about AGI, great market!

ooo maybe a market on which of those two is solved at 80% or above first

@Bayesian sounds like an incentive to finetune my deepseek-giga-overfitter-hle-memorized-v1 model by EOY

@Ziddletwix yeah but that would be CHEATING! and the leaderboard thing would CATCH IT

@Bayesian most likely, but maybe they'll just put an asterisk and scold it in a footnote for being sus & bad. unclear how enforcement is actually handled in practice

fkkkk they might put the footnote saying it's sus affffff then what are we gonna do

@copiumarc I don’t think HLE is harder than Frontier Math.

/Bayesian/which-of-frontiermath-and-humanitys

@mathvc @copiumarc may the person with the best model of reality win

bought Ṁ250 NO

Surely o3 will get >20%?

@qumeric if the benchmark is knowledge heavy it might not do that much better than 4o? prolly will tho. just some low chance that it doesn't

“The dataset consists of 3,000 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.”

Well sorry but people are gonna overfit to this. Who is gonna judge whether the model is overfitted or not?

@mathvc yes i am confused by this point. so if some model near EOY is massively overfit to HLE, scores 90%+, and they chime in "yeah its performance wasn't so crazy strong on our few holdout problems, it probably overfit a bit", that still counts as 90%+ right? is the holdout set just used as a separate confirmation of overfitting, its not incorporated to the main score?

i agree this is troubling. What do you think would be the best way to proceed?

@Bayesian i found that scale.ai and safe.ai partnered to create this benchmark and it seems that they keep up-to-date evaluations of all frontier models:

https://scale.com/leaderboard/humanitys_last_exam

I guess we can trust their judgment? That is, they will not put a clearly overfitted model on the leaderboard since it makes the leaderboard useless

@Bayesian i dunno i think all benchmarks have caveats so i'd just pick some source for what each model has achieved on the benchmark & if their screener for overfitting is weak that's kinda priced in

@mathvc yeah i agree, that probably works well enough. will add to the description

opened a Ṁ25,000 NO at 50% order

upgraded to Premium

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules