Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?
Basic
3
Ṁ3152026
95%
chance
1D
1W
1M
ALL
METR has found that current frontier models get a score on their autonomy benchmark roughly similar to a human who is given 30 minutes. Will at least one model score at the level of a human given 2 hours by 2026?
Clarifications:
I will try to resolve this market in accordance with the current task suite. If METR makes the suite harder or easier I will try to account for this in the resolution of this market.
if I am not able to determine the performance of frontier models at the end of 2025, this market will be resolved NA
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Related questions
Related questions
Will any LLM outrank GPT-4 by 150 Elo in LMSYS chatbot arena before 2025?
13% chance
In 2024, will METR or Google announce the results of a METR eval on a Google LLM?
72% chance
Will an LLM be able to solve the Self-Referential Aptitude Test before 2025?
19% chance
Will an LLM be able to solve the Self-Referential Aptitude Test before 2027?
66% chance
Will an LLM be able to match the ground truth >85% of the time when performing PII detection by 2024 end?
84% chance
LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?
60% chance
Will LLMs be better than typical white-collar workers on all computer tasks before 2026?
27% chance
Will we see improvements in the TruthfulQA LLM benchmark in 2024?
74% chance
Will there be any simple text-based task that most humans can solve, but top LLMs can't? By the end of 2026
63% chance
Will a publicly-available LLM achieve gold on IMO before 2026?
51% chance