What is Grok 4's performance on METR's task length evaluation?
12
125Ṁ1441
resolved Jul 31
100%99.0%
1.5 to 2 Hours
0.3%
0 to 1.5 Hours
0.3%
2 to 2.5 Hours
0.2%
2.5 to 3 Hours
0.2%
More than 3 Hours

Resolves based on the METR's measurement of the duration of tasks that can complete with a 50% success rate.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Grok 4 Heavy does not count

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ59
2Ṁ25
3Ṁ15
4Ṁ14
5Ṁ5
Sort by:

WOW

@Bayesian These scores have high confidence intervals. Predictors on METR task length evals need to be aware of randomness.

© Manifold Markets, Inc.TermsPrivacy