What is Claude Opus 4's performance on METR's task length evaluation?

Ṁ1.2kṀ20k

resolved Jul 1

100%99.0%

1h to 1.5h

0.1%

Less than 1 hour

0.3%

1.5h to 2h

0.4%

2h to 3h

0.2%

At least 3h

METR's evaluation measures AI performance by the duration of tasks that models can complete with a 50% success rate. This market predicts Claude Opus 4's time horizon, as reported by METR.

If no score is provided by the end of July 2025, this market resolves as N/A. If there are multiple scores provided by METR, I'll use my best judgment. I won't trade in this market.

Market context

Claude

METR

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ1,148
2		Ṁ297
3		Ṁ55
4		Ṁ29
5		Ṁ9

People are also trading

Claude Opus 5 METR 50% time horizon

Will the next Claude Sonnet be better than Claude 4.5 Opus at software engineering?

Sort by:

80 minutes

Thomas Kwa works at METR. Link to post:

https://www.lesswrong.com/posts/RnKmRusmFpw7MhPYw/cole-wyeth-s-shortform?commentId=cZWcjhHMvEwCwDWHv

Does it count if it gets to use parallel processing?

@PhilosophyBear ~~Yes~~ See comment below.

@Loppukilpailija shouldn't you try to compare like-for-like with METR's previous results?

@JoshYou Huh. I had thought that METR's resuls already allowed for best-of-N, which would be equivalent/redudant with parallel processing, but apparently I was wrong. I redact my earlier comment and try to do an apples-to-apples comparison.

People are also trading

Claude Opus 5 METR 50% time horizon

Will the next Claude Sonnet be better than Claude 4.5 Opus at software engineering?

79% chance

🏅 Top traders

People are also trading

People are also trading

Related questions