New METR SOTA by end of February, 2026?

Ṁ100Ṁ865

resolved Jan 29

Resolved

YES

ALL

Resolves YES if any model surpasses a 50% time-horizon of 4h 49m on https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ by the end of February

The model must be scored by end of month (not just released by end of month).

Update 2026-01-22 (PST) (AI summary of creator comment): The market resolves YES if the official METR long tasks SOTA goes up (beyond 4h 49m), even if it's from a model which has already been tested. If METR releases a new test suite and uses it to update the values on the existing METR long tasks graph, and as a result one of the models gets a time-horizon greater than 4h 49m, this would resolve YES.

Market context

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ96
2		Ṁ43
3		Ṁ41

People are also trading

Best METR 50% Time Horizon in 2026

Will METR retire the 50% Time Horizon by EOY 2026

45% chance

Will the METR 50% Time Horizon be "ambiguous" at the end of 2026?

63% chance

What will be the METR time horizon doubling time in 2026?

MMLU 99% #4: Will SOTA for MMLU (average) pass 99% by the start of 2027?

8% chance

What will the frontier METR time horizon be on January 1, 2027?

MMLU 99% #5: Will SOTA for MMLU (average) pass 99% by the start of 2028?

44% chance

Will there be a period of 12 contiguous months during which no new compute-SOTA LM is released, by Jan 1, 2033?

71% chance

Will the transformer architecture be replaced in SOTA LLMs by 2028?

57% chance

What will the frontier METR time horizon be on January 1, 2028?

5 Comments

6 Holders

14 Trades

Sort by:

boughtṀ50YES

@Philliesdog i'll bet more at 55% if you'd like

bought Ṁ50 NO

doesnt count if they announce their new test suite where they reevaluated opus 4.5 and it got like 6 hous right

@Bayesian that would possibly meet the res criteria and could count, something similar did in:

/jim/new-metr-sota-by-eoy

but it woudl depend on the specifics

I don't understand how that case is analogous to this one

@Bayesian ok fair i just woke up and misread your message to begin with, anyway if it's a different suite it just depends on how it's presented.

Pretty much, this resolves YES if the official METR long tasks SOTA goes up, even if it's from a model which has already been tested. So if the new suite was used to update the values on the existing METR long tasks graph and as a result one of the models got a bigger than 4h 49m horizon this would resolve as 'yes'