Will Google's Gemini beat GPT4 in terms of capabilities on release?
Apr 3

Resolves YES if the current iteration of gpt4 (at the time of Gemini's release) scores lower than Gemini on a generalized capabilities task, BigBench if no better exist at time of release.

Resolves NO if vice versa

Resolves N/A if unforeseen event causes comparison to be impossible.


Get Ṁ200 play money
Sort by:

I've closed this to trading.

The creator is inactive, and to my knowledge we aren't actually waiting on anything for the market to resolve - resolution is supposed to be about the state of things at the time Gemini (ultra) was released.

So this should be resolved now. Feel free to discuss how you think it should resolve and present your evidence. If it's clear I can resolve it and if not then it will go to a three-mods vote.

I'll leave some time for discussion and then post to discord requesting a mod vote if needed.

Alternatively, if @brubsby is around and seeing this ping, they can resolve themselves.

@brubsby what's the current status here?



sold Ṁ413 NO

@Joshua I'm quite happy to take mana off the table and hand the risk to @chrisjbillington
The MM has a history of being a bit "vibe-based" with the resolution of this specific market. Is 1.5 the same as 1? Ultra? Advanced? What version of GPT 4 are we talking about now? It really could go either way.

Top top that of @ZviMowshowitz and others are reporting that their experience with Gemini is quite good, superior to GPT 4 in many regards.

@jgyou The market says "on release", and whilst there were gripes about whether Pro would count for this (brubsby's market on whether "Gemini" would be released resolved YES on Pro being released), I think it's very clear a 1.5 release would not count. This market is about whether Gemini Ultra right now is better than GPT-4 right now.

bought Ṁ250 of NO

Recent data point not yet posted here: Bard with Gemini Pro score below GPT-4 turbo on chatbotarena

bought Ṁ50 of NO

As tested by whom?

Google modified their methodology until they got a higher score on many benchmarks, without even giving GPT-4 the equivalent methods. Are we doing that here, or do we choose one method and apply it to both?

predicts NO

@brubsby will GPT-4.5, if such a thing is released before Gemini Ultra, count as GPT-4 for the purposes of this market, or no?

predicts NO

the current iteration of gpt4

i would claim yes

sold Ṁ670 of YES

GPT4 + improved prompting gives 89% on Big-Bench-Hard.

(Evaluated Few Shot + CoT)

predicts YES

@jgyou I don't get these results. Why are they getting much better results for GPT-4 with what is reportedly the same methodology as the one reported in the Gemini report? Also, it seems like they just copied the results from the Gemini report, so they didn't actually run Gemini using their benchmarking code.

predicts NO

@Shump Yes, those are just the same Gemini Ultra number (maybe they don't have access yet?). The point is that some version of GPT-4 can be prompted to SOTA when MSFT plays the prompting game.

So I'm not so sure that Ultra is superior anymore.

predicts YES

@Shump >Why are they getting much better results for GPT-4 with what is reportedly the same methodology as the one reported in the Gemini report?

The test results in the Gemini paper are for the June GPT4. These new results are for the current Turbo model.

Either Turbo GPT4 is smarter, or (more likely) additional test answers leaked to the internet and are now in GPT4's training data.

predicts NO

@Coagulopath I think turbo is smarter? but it could be contamination as well

predicts YES

@Coagulopath I think if Turbo is THAT much smarter OpenAI would have been advertising that.
Can someone more familiar with LLMs than me say whether there can be variance in how exactly you n-shot or CoT prompt an AI? I think there are many different potential ways to do either but I'm not sure.

predicts YES
predicts NO

that has nothing to do with bigbench scores lol

sold Ṁ983 of YES

@jacksonpolack It's indicative of the fact that Google is trying to bend the truth about the model's capabilities.

bought Ṁ10 of NO

You see that GPT-4 is actually better if you don't cherrypick. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

bought Ṁ100 of YES

Context is everything? They are arguing for a particular way to do CoT. This all gets baked into the model.

predicts NO

@jgyou No it doesn't. CoT is a prompting approach, not something you bake into a model. Just as 3-shot is just referring to giving it 3 examples beforehand.

predicts YES

@PatrikCihal They don't have this data point for GPT-4 so this comparison isn't apple to apple.

predicts YES

@jgyou the issue is that Google can shop around, trying approach after approach, until they find one that lifts Gemini's score while depressing GPT4's. OpenAI had no such luxury. They couldn't try approach after approach until they found one that beats Gemini—obviously, Gemini hadn't even been trained yet! So this stacks the deck against GPT4.

I think the one-shot and COT results are more meaningful, as opposed to weird model-specific benchmark hacking.

You did not specify what "beat" means. Does it need to be one-shot answers?

predicts NO

@PatrikCihal The fact that Google is tweaking the definition of "beat" is really saddening

More related questions