OpenAI released its o1 model to much fanfare.
https://deepnewz.com/ai/openai-unveils-o1-ai-model-advanced-reasoning-fact-checking-phd-level
LMSys has already announced that these models will be scored on LMSys and will soon appear on leaderboards
The current LMSys leaderboard is headed by GPT-4o-08-08, followed by Gemini and Grok.
https://lmarena.ai/?leaderboard
Will OpenAI's o1 get to #1 on this leaderboard by October 1st?
Several caveats since LMSys is weird...
We will look whatever is posted on October 1st
If an update happens that day, we will count it [so resolves October 2nd]
We use Eastern Time not "updated on" time on LMSys site -- which will often be 7+ days behind....
We will use any OpenAI o1 style model and take the best result
This will probably be "o1-preview" but if they post a better model that will also count
If no o1 model is released by October 1st we will wait until one is posted and extend the market.
As usual, statistical ties count! The market is "will o1 (or any best OpenAI model) be first or tied for first on LMSys?
But in most scenarios we will resolve this on October 2nd.
We also have a market betting on the model's ELO.
https://manifold.markets/Moscow25/what-elo-will-openais-o1-model-get
@yetforever ~10% that either Gemini 2 or Claude 3.5 Opus comes out by September 25 imo, roughly when they'd need to to show up on the leaderboard.
12 days is meaningfully long in AI. E.g. most 100-day periods this year have had significant releases.
@yetforever I think Gemini can beat this ELO. But not unless they have been working on a similar approach for a while. Same for Claude.
The idea has been out there.... but OpenAI clearly beat everyone to doing it first.
But I agree there's some risk!
I don't think any model like this will come out in September, but I think you can probably get a better arena score without o1's specific type of training just by scaling more, especially considering o1 probably wasn't optimized specifically for getting the highest chatbot arena score
@StellarSerene Underrated reason for it not reaching #1 is how much response time factors into preference
@Moscow25 FWIW I, like previous markets, think this could be made clearer in the title. Very reasonable as-is to interpret the current one as "No ranking --> NO" if they're reading the description sloppily. But it's clearly there, so not critical. :)
@HenriThunberg yeah I get it
Manifold limits longer titles, and generally these don't do as well 🤷
I'm very clear on the extra details (many are not) though I find no matter what some non-trivial % of people will complain
@HenriThunberg not at all! You're cool and I like the input.
I've tried it different ways and on this point... have concluded that the title needs to be simple, with all details in the story below -- as clear as possible but not to detract from the theme.
That works best for most people and for me personally. Some will always disagree.
You can't please everyone.
When markets don't describe the details... it seems lazy and drives me crazy! But then some will have such long explanations that you also don't see them below the ... 🤷