Sudden jump in AI long-horizon capabilities (1-4 hr version)
7
1kṀ866
resolved Sep 3
Resolved
YES

After an AI achieve >50% performance on 15-60 minute tasks, will it take less than one year for AI to achieve >50% performance on 1-4 hour tasks?

We will default to use reporting from OpenAI, METR or other large AI organizations. If compelling third-party scaffolding demonstrations reports on this first, I will accept that if I am >90% confident in their results being accurate. The results need not use SWE-bench or METR's pre-existing dataset, if e.g. a model resolves this question on Metaculus that would be obviously sufficient. Agent/assistant tasks and code tasks both count here, if either shows sub 1-year jump then this resolves Yes. I will not predict on this question.

Background: As of mid-2024, models are often far more efficient than humans at <15 minute tasks. However, for >15 minute tasks models remain highly inconsistent.

https://metr.org/blog/2024-08-06-update-on-evaluations/

https://openai.com/index/introducing-swe-bench-verified/

  • Update 2025-09-02 (PST) (AI summary of creator comment): - Metric: Use METR's reported "max time at which a model maintains >=50% success" (50% horizon).

    • Thresholds for ranges: Treat 15–60 min as 30 min and 1–4 hr as 2 hr (geometric means).

    • Resolution rule: t0 = first model with >=30 min 50% horizon; t1 = first model with >=2 hr 50% horizon. If t1 − t0 < 1 year, resolve YES.

    • Creator intends to resolve YES soon using this method; see the linked comment for details.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ203
2Ṁ179
3Ṁ19
4Ṁ13
5Ṁ11
Sort by:

I asked "After an AI achieve >50% performance on 15-60 minute tasks, will it take less than one year for AI to achieve >50% performance on 1-4 hour tasks?"

METR shifted to reporting the max time at which a model can maintain 50% success. So I think it's reasonable to take the geometric mean and require 30 min and 2h 50% time horizons respectively. So, t0 is when 3.5 sonnetv2 hit 50% horizon of 29 min (Oct '24). t1 is when GPT-5 cleared 2h (Aug '25). That puts this at <1 yr. I will shortly resolve this YES unless someone objects -- if you object please propose a better methodology.

@traders

I'm open to suggestions on this question's resolution criteria for a month, and then I'll try to keep revision minimal afterwards.

© Manifold Markets, Inc.TermsPrivacy