Will METR's "50%-task-completion time horizon" SotA double in under 7 months?

Ṁ10kṀ92k

resolved Apr 16

Resolved

YES

ALL

METR recently published an analysis called Measuring AI ability to complete long tasks.

The analysis suggests that the 50%-task-completion time horizon for the State of the Art AI agent is doubling every 7 months:

Figure 1 from the METR paper illustrates the historical exponential growth trend and the ~7 month doubling time.

Market details

This market resolves based on published data from METR (Model Evaluation & Threat Research) or a successor organization explicitly continuing this line of research using a demonstrably equivalent methodology.

Metric: The "50%-task-completion time horizon" as defined in METR's paper.
Baseline date (T0): Feb 24th 2025 (the public release of Claude 3.7 Sonnet)
Baseline State-of-the-Art (SotA) Horizon (H0): 40 minutes the value of the trend line at date T0.
Target Horizon (H1): 80 minutes (2 * H0, i.e. a single doubling).
Target Date (T1): The date on which a model is publicly accessible through for which METR (or another organization using equivalent methodology) reports a 50% task completion time horizon of at least H1 (80 minutes) using their established methodology. This date T1 is the model's public access date, not the date of the METR report itself.

Resolution Criterion

This market asks whether the next doubling of the SotA – reaching a horizon of at least 80 minutes (H1) – will occur faster than this historical average. Specifically, will the time elapsed between the effective date of the previous SotA (T0) and the date the new SotA horizon (H1) is achieved (T1) be less than 212 days (i.e. by 24th September 2025).

Market closing date

The market will close on the 24th of October 2025 to provide 30 days for METR (or another org) to run the METR analysis. It will close earlier if METR (or another org using equivalent methodology) confirm that the 80 minute 50%-task-completion time horizon is achieved before this date.

Update 2025-04-16 (PST) (AI summary of creator comment): Important Notes:
- The resolution will accept a METR-evaluated model even if changes in the task set prevent a direct numerical comparison to the original baseline figures.
- A model that exceeds the 80-minute threshold (for example, reaching 90 minutes) using METR’s current evaluation qualifies.
- The public release date of the model is used to determine the timing criteria, so a release date that meets the requirement (as in the case of o3 on 16th April) confirms resolution.

Market context

AI Benchmarks

METR

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ7,153
2		Ṁ583
3		Ṁ521
4		Ṁ488
5		Ṁ243

People are also trading

Grok 4.20 METR 50% time horizon

Claude Sonnet 4.6 METR 50% time horizon

R2 / V4-Thinking METR 50% time horizon

Grok 5 METR 50% time horizon

What will be the METR time horizon doubling time in 2026?

Will METR retire the 50% Time Horizon by EOY 2026

40% chance

Will the METR 50% Time Horizon be "ambiguous" at the end of 2026?

62% chance

Best AI time horizon by August 2026, per METR?

Best METR 50% Time Horizon in 2026

Will a multi-agent system have its time horizon evaluated by METR before August 2026?

Sort by:

market for next year: https://manifold.markets/JoshYou/how-autonomous-will-ai-be-in-februa

bought Ṁ20,000 YES

https://metr.github.io/autonomy-evals-guide/openai-o3-report/ o3 is 1.5 hours

Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini

Resources for testing dangerous autonomous capabilities in frontier models

@EliLifland this is HCAST only, so not the same set of tasks as the paper, but I think HCAST makes up the majority of tasks in the paper.

I think the question criteria are ambiguous over whether this counts. e.g. it says an evaluation from a non-METR org with the same methodology (in what respects?) counts

@JoshYou Thanks for flagging.

While METR notes the task set changes prevent direct numerical comparison to the exact baseline figures used to establish the original trendline, the assessment and methodology seem sufficiently close to resolve to YES. The report shows a METR-evaluated model (o3) reaching a 90-minute time horizon (and therefore surpassing the 80-minute threshold) using METR's current evaluation. The public release date of o3 (16th April) meets the timing criteria.

@SamuelAlbanie When would the non-preliminary results have come out? Couldn't you have waited for those?

@Panfilo This is a fair question. My interpretation of the METR study is that this is sufficient evidence to resolve.

I'm not sure if there will be non-preliminary results. I'm interpreting their use of the word "preliminary" to mean something like "this is what we could achieve in 3 weeks, be warned that there may be bugs" rather than as indicative that they will continue to work directly on exactly reproducing the full task horizon set up. I could be wrong on this point. There's some ambiguity in the phrase "We are currently working to address some of these limitations, as well as prototyping novel forms of evaluations."

Claude was substantially above trend - wouldn't it make more sense to ask whether the next model will be above the trend line rather than whether it would be double claude?

@Benthamite there are clearly some pros and cons to using Claude 3.7 vs the trend line. However, on balance, I agree with your point - the trend line makes more sense.

I've updated the resolution criterion to reflect this.

@SamuelAlbanie Doesn't using Claude make more sense? Imagine the extreme scenario where Claude was already far enough above the trendline that it fit the "7 month above the trendline" criteria right now. In that case, the question would automatically resolve to yes. But what we are really interested in is whether or not in 7 months we will have something "twice as good" as Claude.

@Soren I feel like if Claude was that high then we would already say that the trend would be broken. And I interpret this market to be trying to measure something like "will the trend be broken?"

I don't know what you mean by that list bit. I'll just point out that Claude is already quite high.

Reasons in favor:
- in the METR report it is indicated that recent progress (~2024-2025) has been significantly faster than over the full timeline (~3 months/doubling instead of 7), likely due (in part) to inference time compute/reasoning

- Sonnet 3.7 is probably a last gen/Ne25-ish era model in terms of compute, but just with SotA post-training/fine-tuning, including reasoning and inference; we will be getting the new Ne26-ish SotA models (OAI, Anth, GDP, maaaybe others) in the coming months (Grok 3 is already out & ~5e26, Gemini 2.5 just released, GPT-5 is likely <7 months out, etc.)

- This whole METR time horizon trend is likely to be super-exponential, 2024-2025 discontinuity in data aside (too long for a fraction of a comment to explain)

Reasons against:

- Unclear robustness/replicability: if you read the report METR has good in-model validation and robustness, but not necessarily out-model, which they also admit explicitly.
- The trend holds over long time horizons with many data point falling above but also below the line of best fit: new model releases this year could easily fall below

- If a model is released that has the required capabilities within the required time frame, it's time horizon capability might not be demonstrated/measured before this market closes (could be a window as low as ~1 month)

Overall, seems likely to be significantly >50%, so great deal atm

People are also trading

Grok 4.20 METR 50% time horizon

Claude Sonnet 4.6 METR 50% time horizon

R2 / V4-Thinking METR 50% time horizon

Grok 5 METR 50% time horizon

What will be the METR time horizon doubling time in 2026?

Will METR retire the 50% Time Horizon by EOY 2026

40% chance

Will the METR 50% Time Horizon be "ambiguous" at the end of 2026?

62% chance

Best AI time horizon by August 2026, per METR?

Best METR 50% Time Horizon in 2026

Will a multi-agent system have its time horizon evaluated by METR before August 2026?

37% chance

🏅 Top traders

People are also trading

People are also trading

Related questions