I was browing Twitter and I saw a post by Karpathy postively talking about ChatBot Arena which is a platform for ranking LLMs based on human ratings. As expected OpenAI is holding positions 1, 2 and 3. I wonder if anyone will be able to take that #1 position for a full week. I will resolve as yes if it happene.
@Soli resolves YES, it has been 8 days since singer's comment below. I was fooled by the "Last updated March 29th" on the leaderboard, but that must have been some other change.
@traders Based on the comments below, I think it makes sense to resolve this question based on the ELO rating in case of a tie in "rank." The resolution criteria clearly stated, "I wonder if anyone will be able to take that #1 position for a full week." Right now, OpenAI still has position #1.
Based on the recent changes to the board's ranking system, I created a series of questions with slightly different resolution criteria to ensure we cover more definitions.
@Soli resolves YES, it has been 8 days since singer's comment below. I was fooled by the "Last updated March 29th" on the leaderboard, but that must have been some other change.
@traders Based on the comments below, I think it makes sense to resolve this question based on the ELO rating in case of a tie in "rank." The resolution criteria clearly stated, "I wonder if anyone will be able to take that #1 position for a full week." Right now, OpenAI still has position #1.
Based on the recent changes to the board's ranking system, I created a series of questions with slightly different resolution criteria to ensure we cover more definitions.
@JacobPfau @Gen @alexlitz , what do you think? OpenAI still has a higher ELO score than Anthropic. How should this case be handled?
@traders what do you all think? Does a tie count even if Anthropic has a lower ELO rating than OpenAI?
@Soli as written, they are rank 1, but it is a mega cringe fake rank 1. I don't really mind which way it goes
@Soli I have a position here so might be biased, but I personally don’t think this counts unless it’s an actual Elo tie. I don’t think anyone was trading on this possibility because they didn’t have any “ties” until now
@JacobPfau For what it's worth, I bet under the impression that "another rank #1" implied ChatGPT on 2nd or below. I have a NO bias.
@Soli I am pretty literalist with it so I would say it should. It is also effectively equivalent to a tie which I would have imagined would have resolved positively (if only due to tiebreaker seeming to have been alphabetical order :))
@Soli the question said '#1 position' not '#1 rank'. So currently I think it's a NO. It should resolve as YES if Claude actually gets to the first row in the table (even if ChatGPT is still listed as #1 by rank).
(I bet on YES myself)
@nsokolsky I agree. Going off the resolution criteria literally, the interpretation implies a negative here.
I wonder if anyone will be able to take that #1 position for a full week.
@singer My thoughts exactly.
The description implies that another model must replace ChatGPT as number 1, not just tie with it (plus GPT still has the highest elo here)
Anthropic wasn't able to take the 1st spot, so I doubt they will for the rest of the year. OpenAI has the initiative now, since they still haven't released their next model.
However, if Google have a Gemini Ultra 1.5 in the works, this could potentially displace OA before they release their next model.
@alexlitz if no one other than OpenAI manages to stay in the #1 position for 1 week then the question resolves No
I think this will happen because Chatbot arena is allowing internet-enabled APIs (Bard) to compete with non internet-enabled APIs (GPT4). This is a bit surprising! But unless MS/OAI move fast to get an internet enabled version out, Bard Ultra should beat GPT4 even though I think Gemini Ultra will turn out to be noticeably worse than the latest GPT-4.
@WillSorenson i don’t think it makes sense for OpenAI to release a model with internet search enabled by default but i am also not sure Google will release Ultra via api with internet access anytime soon - did they announce that they are planning on doing this anywhere?
@dominic Especially because some other version of Gemini Pro is 100 ELO lower, which seems pretty significant