What is Grok 4's performance on METR's task length evaluation?

12

Ṁ125Ṁ1.4k

resolved Jul 31

100%99.0%

1.5 to 2 Hours

0.3%

0 to 1.5 Hours

0.3%

2 to 2.5 Hours

0.2%

2.5 to 3 Hours

0.2%

More than 3 Hours

Resolves based on the METR's measurement of the duration of tasks that can complete with a 50% success rate.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Grok 4 Heavy does not count

Market context

Technical AI Timelines

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ59
2		Ṁ25
3		Ṁ15
4		Ṁ14
5		Ṁ5

Sort by:

WOW

@Bayesian These scores have high confidence intervals. Predictors on METR task length evals need to be aware of randomness.

Grok 4 Heavy market here:

https://manifold.markets/AffineTyped/what-is-grok-4-heavys-performance-o

People are also trading

What is Grok 4 Heavy's performance on METR's task length evaluation?

Grok 4.20 METR 50% time horizon

Grok 5 METR 50% time horizon

How many parameters does Grok 3 have?

Related questions

What is Grok 4 Heavy's performance on METR's task length evaluation?

Grok 4.20 METR 50% time horizon

Grok 5 METR 50% time horizon

How many parameters does Grok 3 have?