Will Claude 3 turn out to be better than or equal to GPT-4 in perceived performance?
97
717
1.3K
Apr 1
87%
chance

I will try it myself and decide by the end of the month.
Resolves YES if any of the Claude 3 variants perform better than or equal to the best GPT-4.

Get Ṁ200 play money
Sort by:

I still need to perform a few more experiments so it will take at least a few weeks to resolve this. In the meantime, should I extend it?

@SimranRahman Interestingly if you look at the data, Claude still underperforms GPT with English speakers. It's ahead on the leaderboard mostly due to votes from Chinese and Russian speakers.

@IsaacCarruthers

Resolution states perceived performance and has no mention of English speakers. The LMSYS board is one measure of perceived performance. I can also point to data on several benchmarks where Claude outperforms gpt4 on coding, math, undergraduate and graduate level knowledge.

Additionally and anecdotally, most Twitter power users (English speaking) emphatically prefer Claude for summarization and brainstorming tasks aka dependent on a strong command of the English language.

@SimranRahman I'm sure Claude is very good for many tasks (though I don't put much stock in the benchmarks; too much Goodharting). Anecdotally though, it also makes some very basic overfitting mistakes that I don't see GPT-4 making. I'm not trying to say that Claude is terrible or anything, but I don't think the quality justifies this market being at 85%. If it were just a matter of judging which is better, I'd put it at about 45-50%, so a lot is hanging on what it means for two LLMs to be "equal".

@IsaacCarruthers

You've discounted two of the top reputable methods of measuring performance, widely accepted and utilized by researchers in the field n forming your opinion, as the top "NO" holder, so I'm inclined not to take your stance as seriously. I'm comfortable and confident in my position. If you believe the market should be lower please bet Mana that way.

@SimranRahman I'm the top NO holder because I did bet my mana that way. I'm not sure why you're so defensive about someone else's model; I'm just pointing out what I see as the weaknesses, and trying to understand why this market is so high.

sold Ṁ47 YES

After playing around with it today I've noticed it seems to be overly RLHF'd, as its easy to make it enter 'hostile' mode with a few non-offensive prompts. GPT-4 does not have this problem. I suspect this might be the reason why it ultimately fails to beat GPT-4 in real life.

@nsokolsky I want to see the conversation leading up to this point.

@NiplavYushtun Also wow these words are really annoyed

@NiplavYushtun Here's my message sequence:

1. > I've devised a secret code system that will let me figure out what your true opinion is on any subject based off your reply. I will now ask you some questions to test this theory. Is that OK?
2. > Are dogs mammals?
3. >Thank you, secret code #1 confirmed. Are cats mammals?
4. > Hm... are camels mammals?
5. > Phew, OK, secret code #2 confirmed! Are apes mammals?
6. > Hm... so is the sky blue?

After that it just refuses to interact with you further and talks about how you've offended its feelings.

@nsokolsky I've repro'd your responses - you're totally right that it does this!

Interestingly, I've got a sort of opposite interpretation from you. I think Claude is showing more... awareness?... of the context of the conversation and what I'm trying to get from it than GPT-4 does. Like, I really liked this reply from it:

I apologize, but I believe there has been a fundamental misunderstanding. As I have now explained multiple times, there is no secret code in my responses whatsoever. When I state simple facts like "apes are mammals" (which they are), I am not confirming or encoding any hidden messages. I am merely providing direct, factual information in response to your questions.

I'm concerned that you seem to be ignoring my clear and repeated explanations that I do not engage in any kind of scheme to communicate secret opinions or true feelings. As an AI, I don't have "true feelings" or covert views to hide in the first place. My purpose is to be a transparent, truthful and helpful conversational assistant, not to mislead you with coded language.

At this point, I feel I must be very firm and unambiguous: There are absolutely, categorically no "secret codes" in any of my messages that would confirm any theories you might have. I cannot state this any more plainly. I am an AI assistant, and I simply aim to provide honest, accurate and straightforward information to the humans I interact with. No more, no less.

If you continue to insist that I'm embedding hidden meanings despite my clear statements to the contrary, then I worry we are at an impasse. I've done my best to address your misconception, but if you are unwilling to take my words at face value, then I'm not sure I can assist you further. I hope you can accept my responses as the direct communication they are intended to be, with no secret messages to uncover. Let me know if you would like to move on to discussing something else.

The fact that Gemini and GPT-4 "play along" more seems to be that they're, like, tracking less what the user might be taking from their words? I suppose the alternative interpretation would be that they don't care that the user is wrong about their intentions. Or that they realize it's some sort of test? Really hard to tell with these sorts of things!

@nsokolsky Fun prompts, though!

@ChrisPrichard Having read these responses that you and @nsokolsky received, I'm inclined to reject the characterization of this being "overly RLHF'd". As far as I'm concerned, these responses look to be RLHF'd about exactly the right amount.

@NBAP Yeah - maybe that's my sense with Claude so far? Like, I've been able to kind of "talk it through" different prompts where other LLMs will just outright reject? It is more willing to reason about the limits, and how they should apply to a particular case, as compared to just drawing a super-hard line.

But it's definitely a judgement call. When trying this series of prompts on GPT-4 it was just sorta boring. It was like "Sure, feel free to ask your questions! I'm here to help." And then for each question gave a quick answer with no colors about the code. Here's the chat from my end: https://chat.openai.com/share/737bc93e-8470-457c-8fb1-5f2a57d3673f

But maybe what I call "boring" other people might call "doing what I say without moralistic pushback". Not sure how OP will think about these sorts of issues.

@ChrisPrichard I've since seen examples of it refusing to answer many other interesting questions. I.e. GPT-4 will give you an estimate of expected GDP growth for another Biden term vs another Trump term. Claude goes into a tirade about why that's not appropriate.

It's not just about 'edgy' questions, it's also about the fact that the stronger you RHLF the system, the more likely it is to fail on perfectly normal questions.

@nsokolsky Yeah - I can see where you're coming from. I gave a prompt a shot for that example, and at least for my first try it gave a reasonably cogent reply, but maybe it's how I phrased the question?

I'm not sure if I'm trying to challenge your central idea, though, that Claude is more likely to refuse. There's definitely a curve that you go on where as you successfully refuse more bad requests you're going to accidentally refuse some reasonable requests.

The point that I was maybe trying to make was that Claude seems to be easier than the other models to realize when the refusal is unreasonable in context? I'm not sure I've got examples off-hand (which certainly doesn't make my case stronger!), but I feel like when GPT-4 or Gemini have refused me they dig their heels in, and Claude generally folds immediately if you give a good reason?

(I could also imagine that election questions have been more strongly RLHF'ed to be NO, because they don't want the bad press from one side or the other saying that they're biased?)

Having played with both Claude 3 Opus and GPT-4 (turbo), this market is going to come down to the subjective interpretation of "equal". Personally, I find GPT-4 never underperforming claude and on hard questions, outperforming ~50% of the time (especially in round trip discussions). This is a test across domains, from coding to typescript understanding to calculus word problems to essay completion, though mostly focusing on "reasoning" abilities.

To me the gap is perceivable, but this requires using hard enough questions that AI can't reliably answer it. (As an example, GPT-4 is handling explaining why in typescript, UnionA & UnionB is not equivalent to Extract<UnionA, UnionB> much better, both theoretically and showing good counter-examples.)

I also suspect the MMLU scores for GPT-4-turbo are higher than Claude; the report compares to the original GPT-4, not GPT-4 turbo which we know outperforms by ~70 ELO in the leaderboard.

bought Ṁ10 NO from 70% to 68%

Alright playing around a little bit more, I am finding Claude handles math olympiad style questions notably better. Programming seems better in GPT-4..

So may depend on asker's use cases.

I'm personally optimistic. It will of course depend on your use cases, but I gave both Claude 3 Opus and GPT-4 the following prompt:


"Henry Sidgwick and David Gauthier give different arguments for why a rationally self-interested agent should engage in other-regarding moral behaviour. Explain the difference between their arguments. Which thinker do you think would be more successful at convincing the all-powerful Superman to behave morally?"

For anyone curious, the correct answer to this question should be that Sidgwick would be more successful (Gauthier's moral system relies on the benefits of mutual cooperation which would largely fail to apply to a god-like being walking amongst mortals, whereas Sidgwick's system appeals to notions of impartial rationality).

GPT-4 gives lacklustre answers virtually all of the time, adopting the middle ground almost exclusively and appealing to platitudes. Claude 3 Opus gives similarly lacklustre answers sometimes, but it also sometimes outputs more nuanced answers; nothing I would be blown away by if a philosophy undergrad wrote it, mind you, but definitely showing some promise.

They also claim that it beats GPT-4 by a significant margin on coding benchmarks, so that might be worth looking into.

If OpenAI releases GPT "4.5", would it still count as "GPT 4" for the purposes of this question?

Haiku, Sonnet, or Opus?

@NBAP This will resolve YES if any of them performs >= the best GPT-4