Which new AI models will be released in February 2025?
251
160kṀ2.3m
resolved Mar 1
Resolved
YES
OpenAI (other)*
Resolved
YES
Google
Resolved
YES
Meta
Resolved
YES
Mistral language model
Resolved
YES
XAI language model
Resolved
YES
Microsoft
Resolved
YES
Anthropic flagship language model**
Resolved
YES
Anthropic reasoning language model***
Resolved
NO
Deepseek language model
Resolved
NO
XAI image or video generation
Resolved
NO
Amazon language model
Resolved
NO
Midjourney
Resolved
NO
Anthropic (other)****
Resolved
NO
OpenAI image generation
Resolved
NO
OpenAI video generation
Resolved
NO
Open AI o3

Released = available to some portion of the public (including a subset of subscribers or a limited number of API developers from members of the public). Released only for safety testing does not count.

New model = Either announced by the company as a new model, is clear from numbering/naming it is a distinct model, or able to be selected from some sort of menu as a distinct model. Something like "o1 extra mini" would count as while it is part of o1 it can be considered a distinct model in this market.

Must be publically released for the first time between February 1st 00:00am PST and February 28th 11:59pm PST. If it is announced but not yet released to any members of the public it will not count.

For answers where no specific model type is specified alongside the company, then any type of generative AI model will cause it to resolve yes.

*OpenAI (other) refers to any model that is not their new flagship model (eg. GPT 5), o3, a video generator, or an image generator. It could be a derivative of another language model or some other type of model such as a voice generator.

**Anthropic flagship language model refers to a model comparable to claude 3.5 or gpt-4o that should outperform claude 3.5 sonnet on a majority of performance benchmarks. This should not be a reasoning model.

***Anthropic reasoning model refers to a model that is not considered their everyday task model and is akin to what OpenAI's O1 is to gpt-4o.

****Anthropic (any other) refers to any model that is not a reasoning model nor their new flagship model. For example, it could be a derivative of an existing language model or a different type of AI model entirely.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ40,580
2Ṁ10,588
3Ṁ8,735
4Ṁ6,158
5Ṁ5,600
Sort by:

@JanPydych
Unfortunately, we are in a position where the norms of AI companies are rapidly changing so I'm going to try and be as fair to the spirit of the question as possible.

Here are my tentative thoughts:

If there is a toggle then that would be sufficient for both the flagship and reasoning model to resolve to yes. At the time this market was created the norm was for AI companies to label such toggles as distinct models a user can choose from.

If the LLM dynamically decides whether it should reason and there is no toggle then this is where I would consider things to become a bit more unclear. As Bayesian said it probably would be fairest to resolve both to YES.

Is there a similar market for March?

@Guuber3 Will make one shortly

@Manifold I think the rest of these can resolve now.

@Manifold resolve no for remaining options

bought Ṁ100 YES

https://x.com/OpenAI/status/1895134318835704245

OpenAI live stream in 4.5 hours!!

@moozooh this doesn't count.

@SirSalty Does this suggest it's using already-released models?

@moozooh yeah its being powered by Claude

@SirSalty What about the speech adapters?

this market gives genuine edge over polymarket

their “will anthropic lead on lmsys on march 1st” market. had them at 2% a week ago

then a model is released. a reasoning model in february as this market had at 90%. and the polymarket lmsys market shoots up to about 12% (with a lot of volatility). from odds of 1:49 to odds of 3:22 after the release, or a bayes factor of 6.7x

maybe sonnet 3.7 is a little more cracked at coding than people expected but not “factor of 6.7” better. the release just wasn’t priced in. manifold dub

i disagree a bit, i think the market knew some model was probably coming out, if it was a reasoning model stacked on 3.5 it would not have topped lmsys bc small reasoning models do terribly there. Coding only matters a bit, same for math, a lot of the points on the lmarena are kinda random stuff, includes creative writing which small reasoning models do bad in, etc. Im personally pretty sure claude 3.7 isn’t gonna get #1 but the fact that u can set it to no reasoning (was at 50% on manifold before release) and the fact that it’s fairly improved from 3.6 is the update imo

I am also interested in this question

Resolved both reasoning and flagship to YES.

I realise that the description failed to capture the evolving way companies are releasing new models and apologise that we weren't able to better anticipate this scenario. We will not be differentiating between reasoning and non-reasoning models in future versions of this market.

opened a Ṁ2,000 YES at 96% order

Looks like there’s a toggle

Market description:

**Anthropic flagship language model refers to a model comparable to claude 3.5 or gpt-4o that should outperform claude 3.5 sonnet on a majority of performance benchmarks. This should not be a reasoning model.

Anthropic description of Claude 3.7 Sonnet:

Today, we’re announcing Claude 3.7 Sonnet[1], our most intelligent model to date and the first hybrid reasoning model on the market.

It seems like a stretch to say that the spirit of a question that explicitly requires, "This should not be a reasoning model," is fulfilled by a model that is trained to reason and sometimes reasons just because it sometimes doesn't reason. Humans often don't reason either; are we not reasoning models?

bought Ṁ400 NO

@Jacy agreed, I traded based on the market description earlier today given what the AWS description stated about Claude 3.7. The toggle part being only clarified in comments is frustrating for sure. It's not clear to me that the toggle does anything more than what R1 does, which is start a new text completion with special tokens. Anthropic seems to be suggesting that Claude-3.7 is genuinely a reasoning model, which means this really should resolve No

@Soaffine I think @SirSalty's comment was reasonable, particularly in that he qualified it with: "Here are my tentative thoughts," rather than phrasing it as a permanent judgment. It would have been hard to make a durable judgment given the wide range of what a "toggle" can be.

Also note that the toggle, as labeled by Claude, is for "extended thinking," not "reasoning."

Yeah it's unfortunate, I think the distinction that the market was trying to draw, but used not quite the rights words for, was:

Flagship: whether there is a new model that can achieve better results with similarly low latency to the previous models

Reasoning: whether there is a new model that achieves better results by thinking longer (reasoning isn't the best description for this)

bought Ṁ800 NO

~

bought Ṁ100 NO

@Soaffine i agree

opened a Ṁ4,000 YES at 97% order

@Soaffine I dont see any reasoning traces for sonnet 3.7

@PaulHabermas

Today, we’re announcing Claude 3.7 Sonnet[1], our most intelligent model to date and the first hybrid reasoning model on the market.

You probably don't see reasoning traces because you're on the free plan. Claude 3.7 Sonnet is capable of "reasoning."

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules