Will any open-source model rank higher than GPT-4 on ChatBot Arena in 2024? (according to ELO Rating)
121
1.5K
1.3K
resolved Apr 9
Resolved
YES
I'm guessing by open source you mean the weights are freely available and not that the training code and data have to also be open source?
Jan 4

I saw a question titled "GPT4 or better model available for download by EOY 2024?" and liked it. Still, I wanted another one with more objective and straightforward resolution criteria.

We use a loose definition of open-source that encompasses all previous versions of llama. In essence if it is theoretically possible for anyone to download the weights and run the model then it is considered opensource.

This market resolves yes if any open-source model achieves an ELO rating that ranks it higher than GPT-4 on ChatBot Arena at any point in 2024. New versions of GPT-4 do not count. The comparison will be done to the earliest GPT-4 version

FAQ

  • What is ChatBot Arena?

    ChatBot Arena is a benchmark platform for large language models (LLMs) that ranks AI models based on their performance. It uses the Elo rating system, widely adopted in competitive games and sports, to calculate the relative skill levels of AI models. This rating system is particularly effective for pairwise comparisons between models. In ChatBot Arena, users can interact with two anonymous AI models, compare their responses side-by-side, and vote for the one they find better. This crowdsourced approach contributes to the Elo rating of each model.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ2,199
2Ṁ260
3Ṁ234
4Ṁ121
5Ṁ118
Sort by:
repostedbought Ṁ200 of YES
reposted

This market believes there is a 97% chance someone will release an open-source model that scores higher than GPT-4 on LMSYS (I agree 😋)

@Soli command-r-plus already did

@notune then i should resolve the market 😅 - @traders any objections here?

@Soli think this should resolve as YES. Seems pretty straightforward

Related market:

bought Ṁ38 of YES

bought Ṁ150 of YES

i am not endorsing or confirming this statement but i will just leave it here

bought Ṁ57 YES
bought Ṁ5 of NO

@Soli AlpacaEval is extremely easy to game (easier to game than chatbot arena), mostly via length. See "AlpacaEval limitations" here : https://tatsu-lab.github.io/alpaca_eval/

predicted YES
repostedbought Ṁ200 of YES
bought Ṁ30 of NO

As new models are added and more comparisons are made, won't the ELO scores shift? Does this market resolve on the absolute score 1158 or whatever GPT-4's score is at the time?

predicted YES

@Vergissfunktor very good question - you are right that the ELO rating can and will move for the earliest version of GPT-4 so using a fixed ELO rating defeats the purpose.

predicted NO

@Soli can you clarify what you are using instead, if not a fixed ELO rating? Do you mean that an open-weights model needs to be above a version of GPT-4 on the LMSYS Chatbot Arena Leaderboard?

bought Ṁ7 of YES

@Jacy exactly - do you have suggestions how i can modify the description to make this clear? I thought it already was 😅

Edit: probably using the words rank higher

predicted YES

@Jacy done

predicted NO

@Soli Thanks! I believe all the criteria you have stated in the description and comments are entailed in this statement:

This market resolves yes if any open source model (i.e. anyone with sufficient hardware and domain knowledge can run the model locally) is ranked with higher elo than [the simultaneous ranking of] the earliest version of GPT-4, which was publicly known to exist as of market creation, on the LMSYS Chatbot Arena Leaderboard at any point in 2024. Otherwise, it resolves no.

To be super clear, you could include that bracketed phrase or add something like, "This means that, at some point in 2024, both an open source model and a version of GPT-4 must appear on the leaderboard simultaneously, and the elo of the open source model must be higher."

Edit: "the earliest version" previously said "any version," but I see the market resolution criteria clearly say earliest version.

predicted YES

@Soli I'd recommend dropping the number 1158 if it's not the absolute measure and just naming the model to compare against in the description

Do you mean Chatbot Arena - LMSYS Org or something else? Because Mistral has already scored 1150 and is open source...

sold Ṁ183 of YES

@Snarflak 1150 is smaller than 1158

bought Ṁ40 of YES

@Seeker 😂

bought Ṁ10 of NO

@Snarflak that's Mistral-medium, which is not (yet?) open source. Mixtral scores 1123

repostedbought Ṁ125 of YES

Are people buying No even aware what is the highest score achieved by an open-source model right now? I am surprised by everyone buying No here but please continue.

bought Ṁ500 YES from 82% to 89%

More related questions