METR recently published an analysis called Measuring AI ability to complete long tasks.
The analysis suggests that the 50%-task-completion time horizon for the State of the Art AI agent is doubling every 7 months:

Figure 1 from the METR paper illustrates the historical exponential growth trend and the ~7 month doubling time.
Market details
This market resolves based on published data from METR (Model Evaluation & Threat Research) or a successor organization explicitly continuing this line of research using a demonstrably equivalent methodology.
Metric: The "50%-task-completion time horizon" as defined in METR's paper.
Baseline date (T0): Feb 24th 2025 (the public release of Claude 3.7 Sonnet)
Baseline State-of-the-Art (SotA) Horizon (H0): 40 minutes the value of the trend line at date T0.
Target Horizon (H1): 80 minutes (2 * H0, i.e. a single doubling).
Target Date (T1): The date on which a model is publicly accessible through for which METR (or another organization using equivalent methodology) reports a 50% task completion time horizon of at least H1 (80 minutes) using their established methodology. This date T1 is the model's public access date, not the date of the METR report itself.
Resolution Criterion
This market asks whether the next doubling of the SotA – reaching a horizon of at least 80 minutes (H1) – will occur faster than this historical average. Specifically, will the time elapsed between the effective date of the previous SotA (T0) and the date the new SotA horizon (H1) is achieved (T1) be less than 212 days (i.e. by 24th September 2025).
Market closing date
The market will close on the 24th of October 2025 to provide 30 days for METR (or another org) to run the METR analysis. It will close earlier if METR (or another org using equivalent methodology) confirm that the 80 minute 50%-task-completion time horizon is achieved before this date.
@Benthamite there are clearly some pros and cons to using Claude 3.7 vs the trend line. However, on balance, I agree with your point - the trend line makes more sense.
I've updated the resolution criterion to reflect this.
Reasons in favor:
- in the METR report it is indicated that recent progress (~2024-2025) has been significantly faster than over the full timeline (~3 months/doubling instead of 7), likely due (in part) to inference time compute/reasoning
- Sonnet 3.7 is probably a last gen/Ne25-ish era model in terms of compute, but just with SotA post-training/fine-tuning, including reasoning and inference; we will be getting the new Ne26-ish SotA models (OAI, Anth, GDP, maaaybe others) in the coming months (Grok 3 is already out & ~5e26, Gemini 2.5 just released, GPT-5 is likely <7 months out, etc.)
- This whole METR time horizon trend is likely to be super-exponential, 2024-2025 discontinuity in data aside (too long for a fraction of a comment to explain)
Reasons against:
- Unclear robustness/replicability: if you read the report METR has good in-model validation and robustness, but not necessarily out-model, which they also admit explicitly.
- The trend holds over long time horizons with many data point falling above but also below the line of best fit: new model releases this year could easily fall below
- If a model is released that has the required capabilities within the required time frame, it's time horizon capability might not be demonstrated/measured before this market closes (could be a window as low as ~1 month)
Overall, seems likely to be significantly >50%, so great deal atm