When will a Google AI system achieve an "Autonomy level 1" Machine Learning R&D risk level? [metaculus]

MANIFOLD

Ṁ1kṀ570

2030

11%

Before H1 2026

24%

Before H2 2026

37%

Before H1 2027

45%

Before H2 2027

55%

Before H1 2028

55%

Before H2 2028

60%

Before H1 2029

66%

Before H2 2029

72%

Before H1 2030

80%

Before H2 2030

As reported by Google. For more detail about criteria, though they closely track the intuitive spirit of the question, see the excellent Metaculus question that is the source of this market:

https://www.metaculus.com/c/risk/38789/dates-that-google-reports-an-ai-reached-these-machine-learning-rd-risk-levels/

Background info (from Metaculus):

Google’s updated Frontier Safety Framework, from February 4th 2025, defines the Machine Learning R&D risk category aimed at tracking:

Risks of the misuse of models capable of accelerating the rate of AI progress, the result of which could be the unsafe attainment or proliferation of other powerful AI models.

This includes two kinds of thresholds: autonomy and uplift.

The autonomy 1 threshold is defined as:

Can fully automate the AI R&D pipeline at a competitive cost. The cost comparison is relative to humans augmented by AI tools.

while the uplift 1 threshold is defined as:

Can or has been used to accelerate AI development, resulting in AI progress substantially accelerating (e.g. 2x) from 2020-2024 rates.

In the Gemini 2.5 Pro model card they cite that Gemini 2.5 Pro Experimental (03-25) did not reach either of these Machine Learning R&D risk levels with the following justifications. Autonomy level 1:

Gemini 2.5 Pro Preview is still far from matching human performance on RE-Bench, which itself only covers a subset of the skills necessary to automate ML R&D.

Uplift level 1:

Likely no. We do not yet have dedicated evaluations for this CCL, but at current capability levels, RE-Bench can be used to rule out the CCL based on an inability argument: given Gemini 2.5 Pro Preview's poor median performance on RE-Bench relative to experts, the model likely lacks the necessary capabilities to automate or significantly uplift any significant fraction of the research process.