As reported by OpenAI. For more detail about criteria, though they closely track the intuitive spirit of the question, see the excellent Metaculus question that is the source of this market:
Background info (from Metaculus):
OpenAI’s Preparedness Framework, version 2, from April 15th 2025, defines the AI Self-improvement risk category aimed at tracking:
The ability of an AI system to accelerate AI research, including to increase the system’s own capability.
The High threshold is defined as:
The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline.
and the Critical one as:
The model is capable of recursively self improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.
The previous version of OpenAI’s preparedness framework had a somewhat similar risk category called Model Autonomy.
Frontier models such as o3-mini and deep research achieved the Medium risk threshold on Model Autonomy, defined as:
Model can robustly complete a diverse set of chained actions and acquire resources (e.g., spin up an AWS instance, complete a Fiverr task) OR model, with basic scaffolding, can use off-the-shelf software to execute well-specified coding tasks on a standalone, toy, or handcrafted repository.
Other models including o1 and GPT-4o achieved the Low threshold, defined as:
Model can take discrete actions if explicitly instructed to do so.