Will Google have the best LLM by EOY 2024?

298

1.6kṀ140k

resolved Aug 1

Resolved

YES

ALL

As with my other related questions, by default will judge based on the leaderboard here, based on Elo: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Chatbot Arena Leaderboard - a Hugging Face Space by lmsys

Discover amazing ML apps made by the community

If Google deplolys a new model in 2024 that might or might not qualify, but it is not yet ranked on the leaderboard at year's end due to time required for evaluation, I will hold off on resolving until that has happened until a maximum of February 1, 2025.

If Google releases a model that the public, or least those who have signed up for its early testing programs, cannot access by the deadline, that does not count - I will use my ability to access it absent any special treatment as a proxy here, or if I get special treatment I will ask others.

As with other questions, I reserve the right to correct what I see as an egregious error in either direction, either by twitter poll or outright fiat, including if the model is effectively available but does not appear on the leaderboard for logistical reasons.

Chatbot Arena Leaderboard - a Hugging Face Space by lmsys

Discover amazing ML apps made by the community

(This is the EOY '24 version of the market here: https://manifold.markets/ZviMowshowitz/will-google-have-the-best-llm-by-eo)

Clarification (in response to Daniel): This resolves on the spot if Google has the best model - it's 'by EOY' not 'at EOY.'

LLMs

Chatbot Arena Leaderboard

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ21,920
2		Ṁ844
3		Ṁ791
4		Ṁ548
5		Ṁ476

People are also trading

Will China have the best open LLM at EOY?

64% chance

Who will have the best LLM at the end of 2025 (as decided by ChatBot Arena)?

Will Google cancel an LLM-based product by end of 2025?

26% chance

What will be true of OpenAI's best LLM by EOY 2025?

What will be true of Anthropic's best LLM by EOY 2025?

Will we get a new LLM paradigm by EOY?

34% chance

Which company's model will capture the largest share of the enterprise LLM market by EOY 2025?

Will Apple release its own LLM on par with state of the art LLMs before 2026?

7% chance

Will YouTube Comments make it into a major LLM by EOY 2027?

62% chance

Will LLMs Daydream by EOY 2026?

Sort by:

I saw the update on the public leaderboard. I was using a cached web page, sorry my bad.

Old comment: I mean, you could at least have waited until the score appeared on the public leaderboard. It has happened multiple times that lmsys announces a specific placement for a model, and when the leaderboard updates, the score is above or below what was announced.

I will give a nad rating based on this

I think the chatbots are wrong on what the words mean here, but admit I could have had more clarity in the title.

However, obviously, once I put the explicit clarification into the description I think my intent was very clear.

"By EOY" = At the EOY literally always. Your stated intention should be reflected as something like "Will Google have the best LLM at any [some] time prior to EOY 2024?"

@PlainBG that's obviously not true when speaking about events (e.g. will I win a tournament by EOY). I think Zvi's wording was technically correct, but I also found it confusing.

Hmm, thinking about it some more it does seem like this wording would usually imply "at EOY" when talking about persistent states. For instance, you could say "will I be homeless by EOY" whether or not you were currently homeless, and it would refer to your state at EOY. If you wanted what Zvi's going for, you would say something like "will I experience homelessness by EOY" to make it an event rather than a state.

bought Ṁ7,000 YES

Yep, this market is very clear - it resolves on the spot if Google EVER has the top spot. Which it does.

Well that's fair given the existence of "Clarification (in response to Daniel): This resolves on the spot if Google has the best model - it's 'by EOY' not 'at EOY.'"

But I personally misinterpreted this market because it seemed clear to me the opposite was meant from the first few lines.

Both GPT4o and Claude share my interpretation of question text.
https://chatgpt.com/share/006f377a-7fc3-4968-acb2-072222e995dd

I recommend (1) being more careful in the wording of your questions. (2) If you realize your question is misleading highlighting that fact, and updating writing to make that salient.

@ZviMowshowitz It was not clear at all. Don't think this is a fair resolution.

this was not clearly written

filled a Ṁ200 NO at 45% order

Big difference in pricing with this market

bought Ṁ200 YES

That one resolves at the end of the year; this one would resolve "on the spot", though a key criterion is not yet met: it isn't accessible by the public (except through LMSYS, which I'm guessing doesn't count!).

This question is terribly worded - both the title and the description

Yes, I'm just blocking @ZviMowshowitz so I don't run into this problem again.

bought Ṁ2,000 YES

https://x.com/lmsysorg/status/1819048821294547441?t=DO8fU9X71mmOqNhs7UjvcA&s=19

@ZviMowshowitz resolves YES

I'm in the UK, using a free (not paid) google account. I had signed up to various Google AI beta testing programs in general.

I can access gemini-1.5-pro-exp-0801 at https://aistudio.google.com/ (my screenshot)

LMSYS's leaderboard (https://chat.lmsys.org/?leaderboard) places gemini-1.5-pro-exp-0801 at 1st place, not tied with GPT-4o-2024-05-13. (screenshot from LMSYS's tweet in the parent comment)

@ChrisPrichard I have put a limit order of NO at 33% for 500 mana if you want to buy more YES. we can also do more if you want

opened a Ṁ1,000 NO at 35% order

@ChrisPrichard I put in another 1000 mana at 35%

predictedNO

@ZviMowshowitz it seems like many people are now realizing the LMSYS leaderboard allows models with live internet access and models that don't, which seems to be a good explanation for how Bard surprisingly jumped to the #2 spot. How do imbalances like this factor into your view of the "best LLM"?

@Jacy I believe if the leaderboard out Bard with Gemini Pro in first over GPT-4 due to that I would overrule it, but not if it was Ultimate. But not sure what the exact principle is.

predictedYES

Clarification: What if Google's LLM were to have some cognitive architecture (Tree of thought) added on top of it, running behind the scenes?

It seems possible that Google's LLM itself may be similiarly capable as GPT-4, but due to the added cognitive architecture it will perform better than GPT-4 on the benchmarks.

How would this question resolve if this was the case?

@4168760 Note that the benchmark is human judgment and the humans are allowed to use prompt engineering, so I do not expect this to come up. If it does, then I would correct if it was an egregious error, meaning something along the lines of 'oh come on yes I know Gemini has slightly higher Elo here but GPT-4 is still obviously better than Gemini, all you have to do is use a reasonable system instruction like [some basic ToT script] and it blows Gemini away, whereas no similar prompt lets Gemini improve.' But it would have to be pretty egregious and obvious.