What is Grok 4's performance on METR's task length evaluation?
12
125Ṁ1441resolved Jul 31
100%99.0%
1.5 to 2 Hours
0.3%
0 to 1.5 Hours
0.3%
2 to 2.5 Hours
0.2%
2.5 to 3 Hours
0.2%
More than 3 Hours
Resolves based on the METR's measurement of the duration of tasks that can complete with a 50% success rate.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Grok 4 Heavy does not count
This question is managed and resolved by Manifold.
Get
1,000 to start trading!
🏅 Top traders
| # | Name | Total profit |
|---|---|---|
| 1 | Ṁ59 | |
| 2 | Ṁ25 | |
| 3 | Ṁ15 | |
| 4 | Ṁ14 | |
| 5 | Ṁ5 |
Sort by:
@Bayesian These scores have high confidence intervals. Predictors on METR task length evals need to be aware of randomness.
People are also trading
Related questions
Grok 4.20's METR 50% time horizon
What is Grok 4 Heavy's performance on METR's task length evaluation?
Grok 4.2 (xAI) release date
Opus 4.5's METR time horizon beats GPT-5.1's?
80% chance
Grok 5's METR 50% time horizon
How many parameters does Grok 3 have?
Will GPT-5.1 have a longer METR time horizon than Gemini 3?
21% chance
When will Grok 3 weights become publicly available?
5/6/26