![](/_next/image?url=https%3A%2F%2Fstorage.googleapis.com%2Fmantic-markets.appspot.com%2Fcontract-images%2Fyaqubali%2FnC5LPn82s0.jpg&w=3840&q=75)
Background
Humanity's Last Exam (HLE) is a benchmark designed to evaluate Al systems' reasoning and problem-solving capabilities across a wide range of academic disciplines, including mathematics, humanities, and natural sciences. Developed collaboratively by the Center for Al Safety and Scale Al, HLE comprises 3,000 unambiguous and verifiable academic questions contributed by nearly 1,000 subject-matter experts from over 500 institutions across 50 countries. The dataset is multimodal, with approximately 10% of the questions requiring both image and text comprehension, while the remaining 90% are text-based.
As of early 2025, state-of-the-art Al models have demonstrated limited success on the HLE benchmark. For instance, OpenAl's 03-mini (high) model achieved an accuracy of 13% when evaluated solely on text-based questions. OpenAl's Deep Research agent, which leverages the 03 model for extensive web browsing and data analysis, reached an accuracy of 26.6% on the HLE benchmark.
Resolution Criteria
This question resolves to YES if a fully automated Al system achieves an average accuracy score of 95% or higher on the Humanity's Last Exam benchmark before January 1, 2028.
• Verification: The score must be verified by credible sources such as peer-reviewed research papers, arXiv preprints, or independent evaluations from reputable Al research institutions.
• Autonomy: The Al must solve problems without any human intervention, external assistance, or reliance on pre-existing solution datasets.
• Compute Resources: There is no limitation on computational resources; Al systems can utilize unlimited resources to attempt solutions.