Will Gemini be widely considered better than GPT-4?
➕
Plus
248
Ṁ82k
resolved Mar 6
Resolved
NO

Resolves 3 months after Bard is switched from PaLM 2 to Gemini. (Resolves NA if it isn't switched to Gemini for some reason)

Unless I'm given a very compelling reason, resolution will be based on https://github.com/FranxYao/chain-of-thought-hub


Specifically - if 3 months after Gemini is released it has at least 3 scores better than GPT-4, and has at least twice as many scores better than GPT-4 than worse, this market resolves YES. Otherwise, assuming no major contention, resolves NO. If not enough benchmarks are available on this repo I will try to find something similar.

Get
Ṁ1,000
and
S3.00
Sort by:

The market will close soon, so now is the time to fill in the gaps in the chain-of-thought hub table https://github.com/FranxYao/chain-of-thought-hub.

Specifically these numbers could affect the resolution:

  • GSM8K accuracy for Gemini Ultra

  • C-eval accuracy for Gemini Ultra

  • BBH accuracy for GPT-4

In order to affect the resolution, the number needs to be accepted into the repo. If when the market closes there's an open PR for one of these and it would change the resolution, I'll wait a few days for it to get merged before resolving.

@SimranRahman Right, but that's an updated version (not released yet if I understand correctly), so does not affect this market which is about the state of affairs at release.

@chrisjbillington Yes. But more specifically - I will resolve the market based on Gemini Ultra 1.0. That's the treatment that's most consistent with how I said I'd treat new versions of gpt-4

predicted YES

Microsoft has new benchmarks out that seem to be a better apples-apples comparison. Google did more rigorous prompt engineering that had been standard in benchmarks in the past.

https://www.microsoft.com/en-us/research/blog/steering-at-the-frontier-extending-the-power-of-prompting/

predicted NO

@WillSorenson Microsoft and Google are on a competition to see who can out-cherrypick the other.

@WillSorenson cool 🙂 honestly if gemini isn't even as good as gpt-4 that's very good news for humanity

@YoavTzfati but let's see what other evidence comes up

predicted YES

@WillSorenson the benchmark hacking is getting a bit comical at this poit. I guess Microsoft is implying "Google got its 90.04% score the same way. We're just being honest about it."

What stood out to me is the re-tested zero-shot scores, with GPT4 now consistently ahead of Gemini Ultra....

Looks like Gemini Ultra scores are on Chain of thought hub - there are three tests in common with GPT-4, two surpassing it and one below (MMLU is below I'm assuming because of taking the 5-shot results). Technically this would mean resolving NO unless something changes until March 6th, but it's definitely close (Doesn't fulfill the "3 scores better" requirement).

If anyone that's NOT invested in the market has an opinion I'd be happy to hear it 🙂

predicted YES

@YoavTzfati I'm invested on yes and agree that resolving No based on your explanation is above board.

Somewhat related

2k lim no at 70 (50 seems ... low though? like if gemini has data contamination that affects benchmarks this still resolves yes)

predicted YES

Looks like the Gemini Ultra demo was mostly fake https://twitter.com/parmy/status/1732811357068615969

bought Ṁ200 NO from 55% to 49%

Makes me think their reported benchmarks are more likely to be heavily cherry-picked

@JonasVollmer

I love that ‘above 90%’ turns out to be exactly 90.04%, whereas human expert is 89.8%, prior SOTA was 86.4%. Chef’s kiss, 10/10, no notes. I mean, what a coincidence, that is not suspicious at all and no one was benchmark gaming that, no way.

https://thezvi.substack.com/p/gemini-10

predicted YES

@JonasVollmer Their benchmarks were industry standards like BBH and MMLU, done the same amount of shot on both models. I don't see how that could be cherry picked.

predicted NO

@ShadowyZephyr You could fine-tune your model in such a way that it does well on MMLU, then selectively only report further benchmarks if it does better on them, and omit the ones it does worse on. You can run the benchmarks multiple times (different temperature etc.) and only report the best runs for your own model, and only the worst ones for competitors.

predicted NO

Gemini Ultra is still very impressive of course, and IDK how much you can really game the benchmarks. I'm no expert on this.

predicted YES

@JonasVollmer The whole point of the MMLU is that it’s very generalized and covers a broad range of useful skills for an AI assistant. If it is “fine-tuned” to do well on MMLU, then it will be able to do normal tasks too, making it better. All the benchmarks listed are well known standards as well.

predicted YES

@ShadowyZephyr Broad and generalized benchmarks are actually one of the major problems with LLMs. They tell you nothing meaningful about specific and narrow use cases. This is particularly a problem with high stakes use cases.

predicted YES

@BTE These models are made to be broad though, not cover specific narrow use cases with the highest performance. There are separate models that do those things.

Are you going to extend the closing date on this market to 3 months from today?

@Eliza done 👍

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules