Will I think that the top Chatbot Arena scores accurately reflect which LLMs are most capable and useful at EOY 2024?
8
661แน€448
resolved Jan 12
Resolved
NO

Many markets on capabilities are being resolved via Chatbot Arena, but I think that Chatbot arena scores might not be a very good measurement. See https://manifold.markets/DanielKokotajlo/gpt4-or-better-model-available-for#Cd09AsavvH6UVdLod4N6 for some discussion.

Another reason why Chatbot Arena could fail is that as models get more powerful, chat use cases are less representative. Minimally, a high fraction of chat use cases right now are very easy and thus saturate on performance.

Note that this market just refers to whether I think that the top few (e.g. top 10) Chatbot Arena scores reflect the actual capabilities reasonably well, not whether Chatbot Arena can't be gamed. So if (e.g.) the best chatbots don't game Chatbot arena (even if they could), then the scores could be sufficiently representative.

I'm open to comments trying to convince me either way, but I don't promise to keep up with this market.

Get
แน€1,000
to start trading!

๐Ÿ… Top traders

#NameTotal profit
1แน€178
2แน€87
3แน€48
4แน€37
5แน€19
Sort by:

I think this market isn't that clear cut, but currently I'm planning to resolve to No.

I'll leave it unresolved for another day if anyone wants to comment arguing for Yes/No.

My basic reasoning for No is:
- Probably the actual top 10 capability ranking is something like o1, o1-preview, 3.5 sonnet new, 3.5 sonnet old, gemini exp (1206), gemini 2.0 flash thinking, gemini 2.0 flash, deep seek v3, o1-mini, gpt-4o (2024-05-13).
- This is pretty different from the current ranking (without style control, note that this market was made before style control).

do you currently?

@jacksonpolack Yes, I think current scores track reasonably well. But not amazingly well. So I would resolve to yes if this was the current end date.

@RyanGreenblatt do you still hold the same opinion?

@Soli Seems more dismal every day lol. Note that this market was created prior to style control and just reflects the pre-style control results.

ยฉ Manifold Markets, Inc.โ€ขTerms + Mana-only Termsโ€ขPrivacyโ€ขRules