Will Al achieve 95% or higher on the Humanity's Last Exam benchmark before 2028?
Basic
1
Ṁ22
2028
40%
chance

Background

Humanity's Last Exam (HLE) is a benchmark designed to evaluate Al systems' reasoning and problem-solving capabilities across a wide range of academic disciplines, including mathematics, humanities, and natural sciences. Developed collaboratively by the Center for Al Safety and Scale Al, HLE comprises 3,000 unambiguous and verifiable academic questions contributed by nearly 1,000 subject-matter experts from over 500 institutions across 50 countries. The dataset is multimodal, with approximately 10% of the questions requiring both image and text comprehension, while the remaining 90% are text-based.

As of early 2025, state-of-the-art Al models have demonstrated limited success on the HLE benchmark. For instance, OpenAl's 03-mini (high) model achieved an accuracy of 13% when evaluated solely on text-based questions. OpenAl's Deep Research agent, which leverages the 03 model for extensive web browsing and data analysis, reached an accuracy of 26.6% on the HLE benchmark.

Resolution Criteria

This question resolves to YES if a fully automated Al system achieves an average accuracy score of 95% or higher on the Humanity's Last Exam benchmark before January 1, 2028.

• Verification: The score must be verified by credible sources such as peer-reviewed research papers, arXiv preprints, or independent evaluations from reputable Al research institutions.

• Autonomy: The Al must solve problems without any human intervention, external assistance, or reliance on pre-existing solution datasets.

• Compute Resources: There is no limitation on computational resources; Al systems can utilize unlimited resources to attempt solutions.

Get
Ṁ1,000
and
S3.00
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules