Will I think that the top Chatbot Arena scores accurately reflect which LLMs are most capable and useful at EOY 2024?

Ṁ660Ṁ448

resolved Jan 12

Resolved

ALL

Many markets on capabilities are being resolved via Chatbot Arena, but I think that Chatbot arena scores might not be a very good measurement. See https://manifold.markets/DanielKokotajlo/gpt4-or-better-model-available-for#Cd09AsavvH6UVdLod4N6 for some discussion.

Another reason why Chatbot Arena could fail is that as models get more powerful, chat use cases are less representative. Minimally, a high fraction of chat use cases right now are very easy and thus saturate on performance.

Note that this market just refers to whether I think that the top few (e.g. top 10) Chatbot Arena scores reflect the actual capabilities reasonably well, not whether Chatbot Arena can't be gamed. So if (e.g.) the best chatbots don't game Chatbot arena (even if they could), then the scores could be sufficiently representative.

I'm open to comments trying to convince me either way, but I don't promise to keep up with this market.

Market context

LLMs

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ178
2		Ṁ87
3		Ṁ48
4		Ṁ37
5		Ṁ19

People are also trading

Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?

72% chance

Thinking Machines to top Chatbot Arena by EOY 2026?

2% chance

Will a Russian-developed LLM reach the top 100 on LMSYS Chatbot Arena by end of 2026?

49% chance

Will the most interesting AI in 2027 be a LLM?

79% chance

Will the LMSYS Chatbot Arena still be 'a thing' in 2027, under the same evaluation method?

Sort by:

I think this market isn't that clear cut, but currently I'm planning to resolve to No.

I'll leave it unresolved for another day if anyone wants to comment arguing for Yes/No.

My basic reasoning for No is:
- Probably the actual top 10 capability ranking is something like o1, o1-preview, 3.5 sonnet new, 3.5 sonnet old, gemini exp (1206), gemini 2.0 flash thinking, gemini 2.0 flash, deep seek v3, o1-mini, gpt-4o (2024-05-13).
- This is pretty different from the current ranking (without style control, note that this market was made before style control).