
After an AI achieve >50% performance on 15-60 minute tasks, will it take less than one year for AI to achieve >50% performance on 1-4 hour tasks?
We will default to use reporting from OpenAI, METR or other large AI organizations. If compelling third-party scaffolding demonstrations reports on this first, I will accept that if I am >90% confident in their results being accurate. The results need not use SWE-bench or METR's pre-existing dataset, if e.g. a model resolves this question on Metaculus that would be obviously sufficient. Agent/assistant tasks and code tasks both count here, if either shows sub 1-year jump then this resolves Yes. I will not predict on this question.
Background: As of mid-2024, models are often far more efficient than humans at <15 minute tasks. However, for >15 minute tasks models remain highly inconsistent.

https://metr.org/blog/2024-08-06-update-on-evaluations/

https://openai.com/index/introducing-swe-bench-verified/
Update 2025-09-02 (PST) (AI summary of creator comment): - Metric: Use METR's reported "max time at which a model maintains >=50% success" (50% horizon).
Thresholds for ranges: Treat 15–60 min as 30 min and 1–4 hr as 2 hr (geometric means).
Resolution rule: t0 = first model with >=30 min 50% horizon; t1 = first model with >=2 hr 50% horizon. If t1 − t0 < 1 year, resolve YES.
Creator intends to resolve YES soon using this method; see the linked comment for details.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ203 | |
2 | Ṁ179 | |
3 | Ṁ19 | |
4 | Ṁ13 | |
5 | Ṁ11 |
I asked "After an AI achieve >50% performance on 15-60 minute tasks, will it take less than one year for AI to achieve >50% performance on 1-4 hour tasks?"
METR shifted to reporting the max time at which a model can maintain 50% success. So I think it's reasonable to take the geometric mean and require 30 min and 2h 50% time horizons respectively. So, t0 is when 3.5 sonnetv2 hit 50% horizon of 29 min (Oct '24). t1 is when GPT-5 cleared 2h (Aug '25). That puts this at <1 yr. I will shortly resolve this YES unless someone objects -- if you object please propose a better methodology.
@traders