Gemini will achieve a higher rating than OpenAI's GPT-4 model on Chatbot Arena Leaderboard before the end of 2024
11
210αΉ€1257
resolved Jan 26
Resolved
YES
Made a slight edit to be clear this market will focus on ***future*** versions of Google's Gemini model when comparing with OpenAI's GPT-4 model. These future iterations may or may not be designated as "Gemini Ultra" models.
This resolves to "YES" if some version of Google's Gemini Ultra (if there are multiple versions) has a higher Elo rating than any OpenAI GPT-4 model on the public leaderboard at any point by April 30th 2024 (23:59 PDT). Note, this is only comparing any GPT-4 model and not necessarily the highest rank or most recent model. [image]This condition has been satisfied, no? Gemini Pro is now above GPT-4-0314 and 0613.

EDIT: Realizing that future versions of Gemini this market is interested in comparing might still not be called "Ultra". So changing the language to include any future version of Google's model referred to as "Gemini".

This is essentially the same question as the one listed below but resolves at the end of 2024 instead of May 1st and includes any future version of Gemini (until the EOY deadline).

For completeness, below is essentially the same description from the first question but with a different date and references to future versions of Gemini (not just "Ultra").


Access to Gemini Ultra hasn't been publicly released as of December 16th 2023 yet but according to Google's technical report from their blog post claims to beat out GPT-4 on various benchmarks.


It's a challenge to evaluate LLM-based chat assistants directly and their multiple methods, but one way was developed to use human preferences in a "Chatbot Arena" as presented in this paper & blog post by the Large Model Systems Organization (LMSYS) team. There is a "Chatbot Arena Leaderboard" on HuggingFace from this team with this idea that uses human preferences to create an Elo rating to rank the different LLM-based chatbots.


As of December 16th, GPT-4 models sit at the top three spots with Gemini Pro sitting below GPT-3.5-Turbo-0613 but slightly above GPT-3.5-Turbo-0314.

Fraction of Model A Wins for All Non-tied A vs. B Battles (2023-12-16)
Highest to lowest ranking of models goes from top to bottom

Bootstrap of MLE Elo Estimates (1000 Rounds of Random Sampling) (2023-12-16)


This resolves to "YES" if some version of Google's Gemini (if there are multiple versions) has a higher Elo rating than any OpenAI GPT-4 model on the public leaderboard at any point by December 31st 2024 (23:59 PDT). Note, this is only comparing any GPT-4 model and not necessarily the highest rank or most recent model.


This resolves "NO" if no version of Google's Gemini (like "Gemini Ultra") appears on the leaderboard but does not ever score a higher Elo rating than an OpenAI GPT-4 model by December 31st 2024 (23:59 PDT).


This will resolve "N/A" if any of the following occurs:

  • The "Chatbot Arena Leaderboard" on HuggingFace (via this link: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) is no longer publicly accessible before the December 31st 2024 deadline.

  • A new version of Google's Gemini comparable to or more capable than "Gemini Ultra" never appears on the leaderboard before the December 31st 2024 deadline. (As of December 17th 2023, only "Gemini Pro" appears on the leaderboard).

  • There are no longer any OpenAI GPT-4 models on the leaderboard before the December 31st deadline and a new version of Gemini has not received a rating (while the GPT-4 models were still on the board).


There might be other edge cases that I might not have thought of but hopefully that covers any that are likely to occur. I may edit/clarify the description if there are new developments or suggestions by others.

The spirit of this is to have a public evaluation metric to compare Google's most capable Gemini model versus one of the most capable models there is today (OpenAI's GPT-4 model).

Get
αΉ€1,000
to start trading!
Β© Manifold Markets, Inc.β€’Termsβ€’Privacy