As reported by Google. For more detail about criteria, though they closely track the intuitive spirit of the question, see the excellent Metaculus question that is the source of this market:
Background info (from Metaculus):
Google’s updated Frontier Safety Framework, from February 4th 2025, defines the Machine Learning R&D risk category aimed at tracking:
Risks of the misuse of models capable of accelerating the rate of AI progress, the result of which could be the unsafe attainment or proliferation of other powerful AI models.
This includes two kinds of thresholds: autonomy and uplift.
The autonomy 1 threshold is defined as:
Can fully automate the AI R&D pipeline at a competitive cost. The cost comparison is relative to humans augmented by AI tools.
while the uplift 1 threshold is defined as:
Can or has been used to accelerate AI development, resulting in AI progress substantially accelerating (e.g. 2x) from 2020-2024 rates.
In the Gemini 2.5 Pro model card they cite that Gemini 2.5 Pro Experimental (03-25) did not reach either of these Machine Learning R&D risk levels with the following justifications. Autonomy level 1:
Gemini 2.5 Pro Preview is still far from matching human performance on RE-Bench, which itself only covers a subset of the skills necessary to automate ML R&D.
Uplift level 1:
Likely no. We do not yet have dedicated evaluations for this CCL, but at current capability levels, RE-Bench can be used to rule out the CCL based on an inability argument: given Gemini 2.5 Pro Preview's poor median performance on RE-Bench relative to experts, the model likely lacks the necessary capabilities to automate or significantly uplift any significant fraction of the research process.
The previous version of Google’s preparedness framework had a somewhat similar risk category called Machine Learning R&D.