What will be the best AI performance on Humanity's Last Exam by December 31st 2025?

111

Ṁ15kṀ180k

resolved Jan 3

100%97%

30-40%

0.1%

0-10%

0.2%

10-20%

0.2%

20-30%

0.7%

40-50%

0.6%

50-60%

0.4%

60-70%

0.4%

70-80%

0.3%

80-90%

0.2%

90-100%

This market is duplicated from and inspired from

/Manifold/what-will-be-the-best-performance-o-nzPCsqZgPc

The best performance by an AI system on the new Last Exam benchmark as of December 31st 2025.
https://lastexam.ai/

Resolution criteria

Resolves to the best AI performance on the multimodal version of the Last Exam. This resolution will use https://scale.com/leaderboard/humanitys_last_exam as its source, if it remains up to date at the end of 2025. Otherwise, consensus of reliable sources may be used (or Moderator consensus).

If the number reported is exactly on the boundary (eg. 10%) then the higher choice will be used (ie. 10-20%).

/Bayesian/which-of-frontiermath-and-humanitys

Market context

AI 2025

Technology

Technical AI Timelines

AI Benchmarks

Humanity's Last Exam

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ3,728
2		Ṁ3,133
3		Ṁ2,422
4		Ṁ2,180
5		Ṁ1,196

People are also trading

Will an AI be capable of achieving a perfect score on the Putnam exam before 2027?

78% chance

Top score on Humanity's Last Exam > 50% by 2029?

99% chance

Will AIs beat human experts in question-answering on the GPQA benchmark before January 1st, 2027?

95% chance

Will the first AI model that saturates Humanity's Last Exam be employable as a software engineer?

81% chance

Will OpenAI's o4 get above 50% on humanity's last exam?

16% chance

Will an AI be capable of achieving a perfect score on the Putnam exam before 2030?

90% chance

Will Al achieve 95% or higher on the Humanity's Last Exam benchmark before 2030?

33% chance

Will an AI be capable of achieving a perfect score on the Putnam exam before 2028?

81% chance

In what year will AI achieve a score of 95% or higher on the PhysBench leaderboard?

2036

Will Al achieve 95% or higher on the Humanity's Last Exam benchmark before 2027?

7% chance

Sort by:

As a methodological note, for these benchmarks I would prefer to have a single market which resolves to a %.

@mr_mino if people disagree between prob of a 39% score vs 41% score a year from now there’s only a 1-2% yearly return to correcting the market under that format, personally I find it very bad for price discovery

@bayesianbot that's true, but this format also means if that if people disagree between a prob of 41% and 43% there is a 0% yearly return to correcting the market. I mostly care about the EV of such markets, but people might differ in this preference.

@Bayesian Just to verify, do you consider the Scale.ai leaderboard to be up to date? Other than the missing Grok 4 score, that's what I am assuming. Are you counting the unconfirmed result of Deepthink of 34.8% no tools that was posted by Google? Also, are you counting the tool use variants?

@Jolliest hmmmmm it is curious that grok 4 is missing from the leaderboard. If it wasn't for that i'd be sure it's up to date but I have to reserve judgement right now bc i can't really tell. but more recent models are present so idk i'll preliminarily say it's up to date. the intent is definitely to prioritize only considering scaffolds + ai systems allowed under the scale ai leaderboard section.

Are you counting the unconfirmed result of Deepthink of 34.8% no tools that was posted by Google?

I am not counting this unless it goes on the leaderboard, at this time

0-10% can probably be resolved no

@redcathode Unfortunately we can’t resolve an option early in the case of multichoicr markets

@Bayesian ah, thanks anyway

Currently this market has an expected average score of 53.6 which I think is quite high. Especially given how the neural scaling laws seem to be coming home to roost.

lots of arb possible with my market https://manifold.markets/jim/when-will-humanitys-last-exam-be-sa

Are models using tools and/or performing web search eligible under current resolution criteria?

@Metastable I think no matter what tools you add to AI, it still remains AI

@mathvc ability to livechat with human experts?

@jim with an exception of using humans 🙂, then it’s definitely not artificial intelligence

@mathvc live access to web is maybe somewhere on that spectrum tho

@jim why you call it live access? It doesn’t go to math forum and make a post about math problems.

You can replicate internet access by scraping it and using as a giant database

@mathvc yeah but using giant database or the web seems like it's less reliant on the AI model's innate knowledge and intelligence, more reliant on human knowledge and intelligence.

i've edited the market description a bit to not be dependent on my own discretion for what model counts or doesn't count. now it uses a consensus of reliable sources or moderator consensus, instead of my own opinion. 🤷‍♂️ probably won't come up anyway but i realized i was amassing a decent position so

bought Ṁ50 YES

Anything 30+ is honestly scary territory. FrontierMath is really impressive and all, and no doubt surprising, but kinda an oh that happened type thing. This is the kind of test that would make me seriously reconsider my beliefs about AGI, great market!

ooo maybe a market on which of those two is solved at 80% or above first

@Bayesian sounds like an incentive to finetune my deepseek-giga-overfitter-hle-memorized-v1 model by EOY

@Ziddletwix yeah but that would be CHEATING! and the leaderboard thing would CATCH IT

@Bayesian most likely, but maybe they'll just put an asterisk and scold it in a footnote for being sus & bad. unclear how enforcement is actually handled in practice