Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?
Basic
2
Ṁ15
2026
57%
chance

METR has found that current frontier models get a score on their autonomy benchmark roughly similar to a human who is given 30 minutes. Will at least one model score at the level of a human given 2 hours by 2026?

Clarifications:

  1. I will try to resolve this market in accordance with the current task suite. If METR makes the suite harder or easier I will try to account for this in the resolution of this market.

  2. if I am not able to determine the performance of frontier models at the end of 2025, this market will be resolved NA

Get
Ṁ1,000
and
S1.00