Resolves YES if any Claude 3 model outranks the best-performing GPT-4 model at any point within two weeks of first being listed, i.e. if users prefer Claude 3 responses to GPT-4 responses at any point.
(GPT-4.5 would not count as a GPT-4 model.)
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
EDIT: Clarification, due to the new way ranks are displayed on the LMSYS Chatbot Arena Elo Leaderboard: If Claude 3 and GPT-4 both have rank 1, but the Arena Elo of the former is greater than the one of the latter, this resolves YES. I believe this is most in line with how it was meant when this market was first created.
@BogdanIonutCirstea So I guess my intuition about opus was correct, but my intuition about how fast the ELO would take to converge was wrong. Bah.
Just to respond once more here, as some people gave me negative reviews:
From the beginning, the market read "within two weeks of first being listed". The market was created at a time when the model had not been listed yet. When it was listed, I updated the closing time of the market to exactly two weeks after the first listing announcement on Twitter.
It seems that some people were surprised that the market closed this morning. Sorry, ideally I would've posted once more here to remind everyone that the market will close soon. But I do think the closing time + initial wording of the question were very clear.
@bobbill I presumed not when betting. It says "within the first two weeks of first being listed". I think that moment is like an hour away now, at least going by the twitter post announcing Claude was on the leaderboard.
If we were going to wait for another update it should have said "in the first update after two weeks had passed" or something like that.
Clarification, due to the new way ranks are displayed on the LMSYS Chatbot Arena Elo Leaderboard:
If Claude 3 and GPT-4 both have rank 1, but the Arena Elo of the former is greater than the one of the latter, this resolves YES.
I believe this is most in line with how it was meant when this market was first created.
@Uaaar33 Claude has been -way- better for me on programming and math problems, especially on more delicate algorithms. It’s also far better at creative writing and document comprehension if properly prompted
@Jacy any reason for buying so much NO with the rankings being as close as they currently are?
@jBosc The confidence intervals don't overlap, so if you assume LMSYS did their statistics well-enough and there's no structural bias in the votes cast so far vs. votes cast over the next two weeks, then this question should really be <5%.
@IsaacCarruthers Looks like they might have done their stats wrong or perhaps claude has been patched?. Old data here: https://www.reddit.com/r/singularity/comments/1b8yucm/chatbot_arena_updatedclaude_3_opus_failed_to_take/. We're at 1247 from 1233 when 95th confidence interval ended at 1242 - seems improbable just from chance.
@Uaaar33 I took a look at the raw data and Claude has been doing much better in head-to-head matches with GPT-4 over the last week than in the week before. It looks like this might be because LMSYS has been getting more attention from non-English speakers recently, and Claude seems to do significantly better than GPT-4 in Russian.
Guys I gave Claude 3 all my tweets and asked it "It's not Mira enough. make it more Mira, like the source". On the 5th iteration I got this:
This is hazardous optimized writing. I can't glance at it without laughing and reading it out loud, even 5 hours later. You guys don't have enough IQ points to understand it, so it might not work on you, but it's the greatest thing I've ever read. Not in terms of stories, but the technical quality of writing.
If the chatbot arena doesn't rate Claude 3 higher, the only explanation will be that we already have AGI(so the AGI markets should resolve YES if this resolves NO) and it's already exceeded the average chatbot arena voter's ability to recognize greatness.
@MiraBot Welp, I guess Claude 3 Opus is AGI, which implies that OpenAI indeed has had achieved AGI internally for a while.