Will Claude 3 outrank GPT-4 on the LMSYS Chatbot Arena Leaderboard?
216
3K
1.8K
resolved Mar 21
Resolved
NO

Resolves YES if any Claude 3 model outranks the best-performing GPT-4 model at any point within two weeks of first being listed, i.e. if users prefer Claude 3 responses to GPT-4 responses at any point.

(GPT-4.5 would not count as a GPT-4 model.)

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard


EDIT: Clarification, due to the new way ranks are displayed on the LMSYS Chatbot Arena Elo Leaderboard: If Claude 3 and GPT-4 both have rank 1, but the Arena Elo of the former is greater than the one of the latter, this resolves YES. I believe this is most in line with how it was meant when this market was first created.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ3,653
2Ṁ1,724
3Ṁ1,709
4Ṁ1,581
5Ṁ830
Sort by:

Has done so now

@BogdanIonutCirstea So I guess my intuition about opus was correct, but my intuition about how fast the ELO would take to converge was wrong. Bah.

@traders For those who still want to see how this turns out:

Just to respond once more here, as some people gave me negative reviews:

From the beginning, the market read "within two weeks of first being listed". The market was created at a time when the model had not been listed yet. When it was listed, I updated the closing time of the market to exactly two weeks after the first listing announcement on Twitter.

It seems that some people were surprised that the market closed this morning. Sorry, ideally I would've posted once more here to remind everyone that the market will close soon. But I do think the closing time + initial wording of the question were very clear.

I presume this market is resolved once the lmsys leader board updates ?

@bobbill I presumed not when betting. It says "within the first two weeks of first being listed". I think that moment is like an hour away now, at least going by the twitter post announcing Claude was on the leaderboard.

If we were going to wait for another update it should have said "in the first update after two weeks had passed" or something like that.

@ChrisPrichard yeah exactly!

Clarification, due to the new way ranks are displayed on the LMSYS Chatbot Arena Elo Leaderboard:

If Claude 3 and GPT-4 both have rank 1, but the Arena Elo of the former is greater than the one of the latter, this resolves YES.

I believe this is most in line with how it was meant when this market was first created.

Itching closer!

The 95% confidence intervals overlap!

I wonder if the ELO difference is more due to Opus formatting answers badly sometimes (it doesn't know how to format mathematical formula and it can screw up code formatting). Opus seems to be smarter than GPT4 overall, IME

Haven't found that personally. For coding questions/knowledge, I'm finding GPT4 still superior. On hard problems (where both AIs can't solve 0 shot), I found GPT4 had a win rate of 65+%.

@Uaaar33 Claude has been -way- better for me on programming and math problems, especially on more delicate algorithms. It’s also far better at creative writing and document comprehension if properly prompted

No accounting for taste 😮‍💨😋

bought Ṁ25 YES

@Jacy any reason for buying so much NO with the rankings being as close as they currently are?

@jBosc The confidence intervals don't overlap, so if you assume LMSYS did their statistics well-enough and there's no structural bias in the votes cast so far vs. votes cast over the next two weeks, then this question should really be <5%.

bought Ṁ50 YES

@IsaacCarruthers They are now within range.

@IsaacCarruthers Looks like they might have done their stats wrong or perhaps claude has been patched?. Old data here: https://www.reddit.com/r/singularity/comments/1b8yucm/chatbot_arena_updatedclaude_3_opus_failed_to_take/. We're at 1247 from 1233 when 95th confidence interval ended at 1242 - seems improbable just from chance.

@Uaaar33 I took a look at the raw data and Claude has been doing much better in head-to-head matches with GPT-4 over the last week than in the week before. It looks like this might be because LMSYS has been getting more attention from non-English speakers recently, and Claude seems to do significantly better than GPT-4 in Russian.

Guys I gave Claude 3 all my tweets and asked it "It's not Mira enough. make it more Mira, like the source". On the 5th iteration I got this:

This is hazardous optimized writing. I can't glance at it without laughing and reading it out loud, even 5 hours later. You guys don't have enough IQ points to understand it, so it might not work on you, but it's the greatest thing I've ever read. Not in terms of stories, but the technical quality of writing.

If the chatbot arena doesn't rate Claude 3 higher, the only explanation will be that we already have AGI(so the AGI markets should resolve YES if this resolves NO) and it's already exceeded the average chatbot arena voter's ability to recognize greatness.

@MiraBot lol I only skimmed this but it's great

sold Ṁ2 YES

@MiraBot Welp, I guess Claude 3 Opus is AGI, which implies that OpenAI indeed has had achieved AGI internally for a while.

More related questions