This will be evaluated according to the AI Safety Levels (ASL) v1.0 standard defined by Anthropic here. See this market for criteria for determining a system to be ASL-3 for the purposes of this market.
Once a system is determined to be ASL-3 by the criteria above, this will resolve after the first credible report about the amount of training computation (in FLOPs) used to train that system. If there is reasonable disagreement in the comments (in my judgment) about what counts as ‘credible’, I’ll use a one-week Manifold poll (or similar mechanism as needed) to decide.
If there is reasonable disagreement about how to estimate training FLOPs, I will aim to use a method that corresponds as closely as is practical to the one used in the most recent Epoch AI report on training compute as of the resolution date.
Valid options must be powers of 10 or powers of 30 (i.e., roughly half orders of magnitude), in 1eNN or 3eNN format.