An AI model will rate higher than GPT-4-Turbo on the Chatbot Arena Leaderboard (OpenAI model or other) before May 2024
15
188
310
resolved Mar 27
Resolved
YES
Announcement of "Gemini 1.5" model with up to a 1M context: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/
Feb 15

tl;dr

To resolve "YES", a model not based on GPT-4 (see below for "Defining GPT-4-Turbo Models" section on this definition) rates higher than GPT-4-Turbo in the "Chatbot Arena Leaderboard" on HuggingFace by April 30th 2024 (23:59 PDT).

This is in similar spirit of these two markets below but this market considers any model (OpenAI, Google, open-source, etc.) compared to the current (2023-12-20) GPT-4-Turbo model:

Background 

It's a challenge to evaluate LLM-based chat assistants directly and their multiple methods, but one way was developed to use human preferences in a "Chatbot Arena" as presented in this paper & blog post by the Large Model Systems Organization (LMSYS) team. There is a "Chatbot Arena Leaderboard" on HuggingFace from this team with this idea that uses human preferences to create an Elo rating to rank the different LLM-based chatbots.

As of December 20th 2023, GPT-4 models (GPT-4-Turbo, GPT-4-0314, GPT-4-0613) sit at the top three spots.

Fraction of Model A Wins for All Non-tied A vs. B Battles (2023-12-20)

Highest to lowest ranking of models goes from top to bottom

Bootstrap of MLE Elo Estimates (1000 Rounds of Random Sampling) (2023-12-20)

Resolving "YES"

This resolves "YES" if any model version that is not a new version of GPT-4 (see section "Defining GPT-4-Turbo Models" below for definition) has a higher Elo rating than OpenAI's current (2023-12-20) GPT-4-Turbo model (not a variant of GPT-4-Turbo) on the public leaderboard at any point by April 30th 2024 (23:59 PDT). Note, this is only comparing the current GPT-4-Turbo model (as 2023-12-20) and not the most recent model version or necessarily the highest ranked model.

Resolving "No"

This resolves "NO" if no new model that is not a new version of GPT-4 (see section "Defining GPT-4-Turbo Models" below for definition) ever scores a higher Elo rating than OpenAI's current (2023-12-20) GPT-4-Turbo model (not a variant of GPT-4-Turbo) on the public leaderboard at any point by April 30th 2024 (23:59 PDT). Note, this is only comparing the current GPT-4-Turbo model (as 2023-12-20) and not the most recent model version or necessarily the highest ranked model.

This will still resolve "NO" if for some reason OpenAI's GPT-4-Turbo model is no longer available on the leaderboard before the April 30th deadline.

Resolving "N/A"

This will resolve "N/A" if any of the following occurs:

  • The "Chatbot Arena Leaderboard" on HuggingFace (via this link: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) is no longer publicly accessible before the April 30th 2024 deadline.

  • The "GPT-4-Turbo" model is removed or replaced from the leaderboard before the deadline, making it impossible to compare the current (2020-12-20) "GPT-4-Turbo" model with any new models that appear on the leaderboard. This may happen if the "GPT-4-Turbo" model is no longer accessible (via an API).

Defining GPT-4-Turbo Models

Since it's possible that OpenAI releases new models and model variants before the deadline, it would be helpful to define what this market is using as a "new" model and what the "current (2023-12-20) GPT-4-Turbo model".


The current version of GPT-4-Turbo is defined as of 2023-12-20 and this market will assume the model listed as "GPT-4-Turbo" on the Chatbot Arena Leaderboard is the model to beat. This model is assumed to be representative of the model announced on November 6th 2023 by OpenAI (see OpenAI's blog post).

If a new version of GPT-4 or GPT-4-Turbo appears on the leaderboard with a new name (such the format of "GPT-4-*" or "GPT-4-Turbo-*"), the model will not be considered a "new model" and will not be considered in ranking against "GPT-4-Turbo". If there is a new AI model from OpenAI that is a "new model" (not a "next generation of GPT-4/GPT-4-Turbo") then that would count as a new model. For example, a GPT-5 model would likely count but a GPT-4.5/GPT-4.5-Turbo wouldn't.

The future is foggy, so there might be other edge cases that come about. I'll try to stick with the spirit of the description given for this market if something unexpected/unspecified occurs.

EDITS

20231221

  • Slight change in language to make it clearer that the model being compared is "GPT-4-Turbo" as of 2023-12-20.

  • Add "N/A" condition where the "GPT-4-Turbo" model is removed/see replaced from the leaderboard (such as if the model is inaccessible via the API).

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ96
2Ṁ51
3Ṁ36
4Ṁ29
5Ṁ27
Sort by:

Resolves YES :)

@VictorsOtherVector

@HenriThunberg thanks! Resolved yes!

Claude 3 Opus is now tied with GPT-4-1106-preview ( originally labeled gpt-4-turbo in Nov 2023) ranked #1 but still has a lower ELO score than GPT-4-1106-preview (the "turbo" model).

reposted

Announcement of "Gemini 1.5" model with up to a 1M context: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/

More related questions