Resolves YES if any model surpasses a 50% time-horizon of 4h 49m on https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ by the end of February
The model must be scored by end of month (not just released by end of month).
Update 2026-01-22 (PST) (AI summary of creator comment): The market resolves YES if the official METR long tasks SOTA goes up (beyond 4h 49m), even if it's from a model which has already been tested. If METR releases a new test suite and uses it to update the values on the existing METR long tasks graph, and as a result one of the models gets a time-horizon greater than 4h 49m, this would resolve YES.
People are also trading
@Bayesian that would possibly meet the res criteria and could count, something similar did in:
but it woudl depend on the specifics
@Bayesian ok fair i just woke up and misread your message to begin with, anyway if it's a different suite it just depends on how it's presented.
Pretty much, this resolves YES if the official METR long tasks SOTA goes up, even if it's from a model which has already been tested. So if the new suite was used to update the values on the existing METR long tasks graph and as a result one of the models got a bigger than 4h 49m horizon this would resolve as 'yes'