Who will have the best LLM at the end of 2024 (as decided by ChatBot Arena)?
Dec 31

I was browsing Twitter, and I saw a post by Karpathy positively talking about ChatBot Arena, which is a platform for ranking LLMs based on human ratings. As expected, OpenAI is holding positions 1, 2, and 3. I wonder which company will be #1 at the end of 2024.

Screenshot of the rankings table taken on the 13th of December:

bought Ṁ50 Meta YES

Argument in favor of Meta - they have a shit ton of GPUs coming in. It is very possible that OpenAI don't actually have any proprietary breakthroughs, so it's just a matter of executing of publicly known methods, data, and compute, where Meta has all of those.

bought Ṁ10 OpenAI YES

OpenAI retaining leadership

@traders Based on the comments below, I think it makes sense to resolve this question based on the ELO rating in case of a tie in "rank." When I created this question, a tie was not an option, so I doubt anyone even traded based on this assumption.

I created a similar question that only uses the rank. Feel free to trade on it.

I would resolve to highest ELO, fwiw.

Lmsys keeps throwing curveballs at prediction markets!


Can we have a clarification whether the rank is important (as is literally stated_) or ELO is important?

bought Ṁ200 OpenAI YES

@WillSorenson hahaha, I never expected this to happen - I hope that at the end of the year, only one company will be in spot #1, but if this is not the case, then we can either resolve each winning option to a %, or we can rely on the ELO scores. What do you think makes more sense?

@Soli I am so surprised they're calling them all #1 (I guess because of confidence intervals, right?) I think quite clearly ELO should be what counts for this question. Feel like I can say this with a clean conscience as I think my positions just aren't that big.

@Soli Confidence intervals exist for a reason. They are there to tell us that there is a meaningful or significant difference between the two ELO scores. So to ignore that and only focus on ELO is wrong as you are leaving it up to chance.

The OP says "best LLM" (not highest ELO) and asks which will be rank #1. Both of these specifically reference the ranking. IMO it's clear in that respect.

@AJama I see where you are coming from. I have one more question: Does chess also have a confidence interval in the official rankings?

@Soli FIDE ranking doesn't. But we can only assume they do it like that for simplicity or historical reasons. Confidence intervals are more objective, and we ought to take the LLM rankings as is imo. If they use confidence intervals to determine the #1 rank, that's how they do it, right?

Hence "as decided by ChatBot Arena"

@Soli I think the way you've worded it initially (both #1, and as decided) means it should probably be resolved as a % if this happens.

Funnily enough, this much less precise wording ended up getting at the essence of the question better! https://manifold.markets/HankyUSA/who-will-own-the-model-at-the-top-o

Looking at this screenshot from February, it's kind of annoying how they've changed that they didn't use to do shared spots (consider how close #6 and #7 are here, and overlapping when applying the CIs) but now revamped that...

bought Ṁ1 Mistral YES

@Soli I'm fine with many interpretations here given that we've still got a long time until this resolves. My slight inclination is to resolve to highest elo even if the CIs overlap with others.

opened a Ṁ50 Apple YES at 4% order

I've added an option for Apple

@MrLuke255 I don’t see any reason for Apple to release an LLM through an API

@Soli Hmmm good point. We'll see

Google is now at #3 and Gemini Ultra is yet to be released. Buying up Google to 20%.

@nsokolsky That was cheating. Bard is more than just a LLM.

if we allow agents on the board, GPT4+bing search can easily go to 1350 range

@Sss19971997 The resolution criteria only talk about the Chatbot Arena ranking. It should resolve to Google even if someone literally bribes Chatbot Arena to put Google on the first spot.

@nsokolsky google+search counts although i do find it a bit weird but bugs, obvious cheating, bribes won’t count.

@Soli hm... then you need to update the market description, as well as an explanation of what happens if ChatBot Arena shuts down.

@nsokolsky for me it is common sense but sure will update later to avoid unecessary complaints

@nsokolsky LLM is a term defined well enough to exclude LLM+search+agents

@Sss19971997 Realistically may not be much of a problem. OpenAI will probably include a version that can search like ChatGPT at some point.

Also it more realistically reflects what using these LLMs are like in the wild.

@Sss19971997 Idk, I put my predictions based off the text of the question. You want to define it more narrowly go ahead - but I always bet on the exact words in the prediction market.

@Sss19971997 I also hate that Chatbot arena pits search-enabled vs pure LLMs against eachother but I agree with @nsokolsky that unless we're being very pedantic, the Bard model would count under the current wording.

