Released = available to some portion of the public (including a subset of subscribers or a limited number of API developers from members of the public). Released only for safety testing does not count.
New model = Either announced by the company as a new model, is clear from numbering/naming it is a distinct model, or able to be selected from some sort of menu as a distinct model. Something like "o1 extra mini" would count as while it is part of o1 it can be considered a distinct model in this market.
Must be publically released for the first time between February 1st 00:00am PST and February 28th 11:59pm PST. If it is announced but not yet released to any members of the public it will not count.
For answers where no specific model type is specified alongside the company, then any type of generative AI model will cause it to resolve yes.
*OpenAI (other) refers to any model that is not their new flagship model (eg. GPT 5), o3, a video generator, or an image generator. It could be a derivative of another language model or some other type of model such as a voice generator.
**Anthropic flagship language model refers to a model comparable to claude 3.5 or gpt-4o that should outperform claude 3.5 sonnet on a majority of performance benchmarks. This should not be a reasoning model.
***Anthropic reasoning model refers to a model that is not considered their everyday task model and is akin to what OpenAI's O1 is to gpt-4o.
****Anthropic (any other) refers to any model that is not a reasoning model nor their new flagship model. For example, it could be a derivative of an existing language model or a different type of AI model entirely.
Update 2025-02-03 (PST) (AI summary of creator comment): Deep Research o3 Resolution Clarification
o3 from Deep Research: Even though the released agent uses a fine-tuned version of o3 and meets the OpenAI (other) derivative criteria, it will not count because the underlying model (o3) is not directly usable or publicly released.
Public Release Requirement: The model must be directly available to some portion of the public to be considered released.
Dylan Patel repeats claim about Anthropic having a better reasoning model than o3: https://x.com/mark_k/status/1886769660344877073
@Manifold Deep Research doesn't fulfill your requirements to resolve as "other". Other models must be totally distinct models or be based on an expansion of an already existing and released model. You wrote "Something like "o1 extra mini" would count as while it is part of o1 it can be considered a distinct model in this market." This would make deep research resolve as "other" if and only if another version of o3 had already been released, which was not the case. Deep Research is not an expansion of a released model, it is the only model that we have right now for o3. When OpenAI releases another version of o3, only then the market can resolve as "other". You cannot have another model if you don't have the base model before.
@SimoneRomeo Does that count as a release? It seems like it's only being used indirectly.
@TimothyJohnson5c16 It's not a version of 4o, not a version of o1, not a version of o3-mini, not a totally distinct model that went through a different training process. It's "powered by a version of OpenAI o3". It counts.
@Manifold resolve openai other to yes
https://x.com/markchen90/status/1886341752245915903
@Bayesian “At around 45:30 Dylan Patel says that anthropic has an unreleased reasoning model that's better than o3:
https://youtu.be/7EH0VjM3dTk?si=DHQJtbBDCphpbkuL“ (h/t @MalachiteEagle)
also I bid it up to arb with this market
@summer_of_bliss Yeah but lke obviously anthropic is sitting on some amazing models, but their biz model doesnt require releasing them
@Arcmage7000 I think mainly depends on
Is there reasoning trace
Do Anthropic specifically call it reasoning model with different approach from previous models
@Arcmage7000 Anthropic seems to be the one company that doesn't believe in differentiating/marketing models as reasoning/non-reasoning, so might get ambiguous
@summer_of_bliss Perhaps, I don't know much about the difference between reasoning/not; I guess my question would be, will resolution depend more on how it's marketed, or what's going on under the hood?
Once the sweepstakes answers are a bit more calibrated I will add more sweepcash liquidity!
For traders' convenience, I also plan to write a list of the models most recently released before February for each answer (I will do this tomorrow when I have time - if someone else wants to help feel free to do so and I will pin it).
I didn't include GPT-5 in the market for this month as we have plenty of other markets for it and it seems very unlikely. I'll add it back in as it once it seems more plausible.