Sudden jump in AI long-horizon capabilities (1-4 hr version)

1kṀ866

resolved Sep 3

Resolved

YES

ALL

After an AI achieve >50% performance on 15-60 minute tasks, will it take less than one year for AI to achieve >50% performance on 1-4 hour tasks?

We will default to use reporting from OpenAI, METR or other large AI organizations. If compelling third-party scaffolding demonstrations reports on this first, I will accept that if I am >90% confident in their results being accurate. The results need not use SWE-bench or METR's pre-existing dataset, if e.g. a model resolves this question on Metaculus that would be obviously sufficient. Agent/assistant tasks and code tasks both count here, if either shows sub 1-year jump then this resolves Yes. I will not predict on this question.

Background: As of mid-2024, models are often far more efficient than humans at <15 minute tasks. However, for >15 minute tasks models remain highly inconsistent.

https://metr.org/blog/2024-08-06-update-on-evaluations/

https://openai.com/index/introducing-swe-bench-verified/

Update 2025-09-02 (PST) (AI summary of creator comment): - Metric: Use METR's reported "max time at which a model maintains >=50% success" (50% horizon).
- Thresholds for ranges: Treat 15–60 min as 30 min and 1–4 hr as 2 hr (geometric means).
- Resolution rule: t0 = first model with >=30 min 50% horizon; t1 = first model with >=2 hr 50% horizon. If t1 − t0 < 1 year, resolve YES.
- Creator intends to resolve YES soon using this method; see the linked comment for details.

Technology

Technical AI Timelines

OpenAI

AGI

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ203
2		Ṁ179
3		Ṁ19
4		Ṁ13
5		Ṁ11

3 Comments

6 Holders

13 Trades

Sort by:

I asked "After an AI achieve >50% performance on 15-60 minute tasks, will it take less than one year for AI to achieve >50% performance on 1-4 hour tasks?"

METR shifted to reporting the max time at which a model can maintain 50% success. So I think it's reasonable to take the geometric mean and require 30 min and 2h 50% time horizons respectively. So, t0 is when 3.5 sonnetv2 hit 50% horizon of 29 min (Oct '24). t1 is when GPT-5 cleared 2h (Aug '25). That puts this at <1 yr. I will shortly resolve this YES unless someone objects -- if you object please propose a better methodology.

@traders