Will Google have the best LLM by EOY 2024?

As with my other related questions, by default will judge based on the leaderboard here, based on Elo: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Chatbot Arena Leaderboard - a Hugging Face Space by lmsys

Discover amazing ML apps made by the community

If Google deplolys a new model in 2024 that might or might not qualify, but it is not yet ranked on the leaderboard at year's end due to time required for evaluation, I will hold off on resolving until that has happened until a maximum of February 1, 2025.

If Google releases a model that the public, or least those who have signed up for its early testing programs, cannot access by the deadline, that does not count - I will use my ability to access it absent any special treatment as a proxy here, or if I get special treatment I will ask others.

As with other questions, I reserve the right to correct what I see as an egregious error in either direction, either by twitter poll or outright fiat, including if the model is effectively available but does not appear on the leaderboard for logistical reasons.

(This is the EOY '24 version of the market here: https://manifold.markets/ZviMowshowitz/will-google-have-the-best-llm-by-eo)

Clarification (in response to Daniel): This resolves on the spot if Google has the best model - it's 'by EOY' not 'at EOY.'

Get Ṁ600 play money
Sort by:
predicts NO

@ZviMowshowitz it seems like many people are now realizing the LMSYS leaderboard allows models with live internet access and models that don't, which seems to be a good explanation for how Bard surprisingly jumped to the #2 spot. How do imbalances like this factor into your view of the "best LLM"?

@Jacy I believe if the leaderboard out Bard with Gemini Pro in first over GPT-4 due to that I would overrule it, but not if it was Ultimate. But not sure what the exact principle is.

predicts YES

Clarification: What if Google's LLM were to have some cognitive architecture (Tree of thought) added on top of it, running behind the scenes?

It seems possible that Google's LLM itself may be similiarly capable as GPT-4, but due to the added cognitive architecture it will perform better than GPT-4 on the benchmarks.

How would this question resolve if this was the case?

@4168760 Note that the benchmark is human judgment and the humans are allowed to use prompt engineering, so I do not expect this to come up. If it does, then I would correct if it was an egregious error, meaning something along the lines of 'oh come on yes I know Gemini has slightly higher Elo here but GPT-4 is still obviously better than Gemini, all you have to do is use a reasonable system instruction like [some basic ToT script] and it blows Gemini away, whereas no similar prompt lets Gemini improve.' But it would have to be pretty egregious and obvious.

Clarification: What if a Google LLM takes the lead and then, say, GPT-5 unseats it again before EOY 2024? Related question: Can this resolve early?

@dreev Clarification: As per question original wording, this can resolve early, and will if the answer is already YES. It's by EOY 2024, not at EOY.

@ZviMowshowitz how long does it have to hold the top spot on the leaderboard? a week? a day? an hour?

@JaesonBooker No time minimum, so long as sample size is big enough for proper ranking.

More related questions