Will Gemini be widely considered better than GPT-4?
78
closes Dec 31
69%
chance

Resolves 3 months after Bard is switched from PaLM 2 to Gemini. (Resolves NA if it isn't switched to Gemini for some reason)

If no one recommends anything that captures the title's intention better, resolution will be based on https://github.com/FranxYao/chain-of-thought-hub


Specifically - if 3 months after Gemini is released it has at least 3 scores better than GPT-4, and has at least twice as many scores better than GPT-4 than worse, this market resolves YES. Otherwise, assuming no major contention, resolves NO. If not enough benchmarks are available on this repo I will try to find something similar.

Get Ṁ500 play money

Related questions

Will Google's Gemini beat GPT4 in terms of capabilities on release?
brubsby avatarbrubsby
70% chance
Will Google's Gemini outperform GPT-4 in the SuperGLUE benchmark test by December 2023?
FranklinBaldo avatarFranklin Baldo
55% chance
Will Gemini achieve a higher score on the SAT compared to GPT-4?
VictorLJZ avatarVictor Li
72% chance
Will Google's Gemini model be released in 2023?
brubsby avatarbrubsby
85% chance
Will "Gemini smash GPT-4 by 5x"?
Shump avatarShump
27% chance
Will Google Gemini perform better (text) than GPT-4?
hyperion avatarhyperion
84% chance
Will Gemini outperform GPT-4 at mathematical theorem-proving?
AdamK avatarAdamK
71% chance
Will Google's Gemini LLM have qualitatively different capabilities to GPT-4?
RH avatarRH
84% chance
Will Google Gemini do as well as GPT-4 on Sparks of AGI tasks?
Google Gemini will get fewer headlines in month week after release than chat gpt got in 1 day after release.
tftftftftftftftftftftftf avatarTessa Barton
43% chance
Does Google Gemini have more than 500B parameters per expert?
ahalekelly avatarAdrian
31% chance
Will Gemini exceed the performance of GPT-4 on the 2022 AMC 10 and AMC 12 exams?
AdamKaufman avatarAdam Kaufman
73% chance
Will OpenAI announce a GPT-4 successor shortly after Google releases Gemini?
Is Google Gemini a Mixture of Experts?
ahalekelly avatarAdrian
79% chance
Will Google Gemini's parameter count be public?
If DeepMind's Gemini uses MCTS, will the model be for a narrow domain?
1a3orn avatar1a3orn
22% chance
Will Gemini take video as an input modality?
PeterBarnett avatarPeter Barnett
39% chance
Will Gemini achieve a score above 90% on the MMLU benchmark?
3684 avatarCuriousNoob
67% chance
Will Google Gemini be free to use?
brubsby avatarbrubsby
85% chance
Will Google Deepmind's autonomous agent be based on Gemini?
3684 avatarCuriousNoob
77% chance
Sort by:
JeronimoD avatar
JDpredicts NO

Today's announcements, speech as output and image as input are still considered base GPT4?

1 reply
YoavTzfati avatar
Yoav Tzfati

@JeronimoD It shouldn't really matter, as the benchmark is a text benchmark.

That said, the image input has always been a part of gpt-4, they just didn't want to release it until now, so I would say it counts. The speech I/O isn't really gpt-4, it's other models working together. But they're also not really relevant to the question of intelligence imo

NielsW avatar
Niels W

I could imagine that Gemini will be widely considered better because it is extremely good at tool usage and integrating information from different modalities, but I think neither of those aspects is measured by that chain-of-thought-hub. If the general vibe is this but benchmarks don't reflect it, how would you resolve?

2 replies
YoavTzfati avatar
Yoav Tzfati

@NielsW I think that if it's less intelligent at answering textual questions it won't be widely considered better. But just in case - it's important to me to use a reasonable benchmark for resolution (instead of relying on my subjective judgement). If you find a benchmark that captures this complexity and people are in favor of changing the resolution criteria I'll change them, otherwise I'll stick to chain-of-thought-hub

YoavTzfati avatar
Yoav Tzfati

@NielsW new benchmark that includes vision (but uses gpt-4 as a judge, so probably not useful for this market)

https://link.sbstck.com/redirect/62d83976-87fd-4fea-9dc4-9bf74483d266?j=eyJ1IjoicmMzNW4ifQ.5rinJgpxgiXH9cE1sqgY2kU_Wy7ghf_npi0XFO4xXqs

NikhilVyas avatar
Nikhil Vyas

If GPT-4 is updated which version will be considered for this question?

1 reply
YoavTzfati avatar
Yoav Tzfati

@NikhilVyas Interesting question. On one hand, it makes more sense to predict relative to today's GPT-4. On the other hand, this question represents 'is OpenAI ahead of Google'.

Proposal: If it's different enough that it gets a separate entry on chain-of-thought-hub, I'll compare to the original version. If it's updated and chain-of-thought-hub just updates the gpt-4 row, I'll use the updated score.

brubsby avatar
brubsbybought Ṁ0 of YES

Similar market (algorithm only served me yours after I made mine)

MatthewRitter avatar
Matthew Ritter

The market sits about where I expect, so I’m not betting yet, but it seems like the strongest case for yes is that Alphabet has lots of data which is not public. I don’t know the exact terms of the open in A.I deal with Microsoft, but it seems reasonable to assume the Alphabet team will have an easier time accessing some specific data set that they realize brings a lot of leverage to model quality.

I have zero idea what that data set will be, but Pareto distributions pop up in many places, and there is no reason to believe all of the “20% that brings 80%“ happens to be available for a scraping from the public web

5 replies
MatthewRitter avatar
Matthew Ritter

@MatthewRitter Also, I am trying out Bard now. It makes the “thumbs up/thumbs down“ symbols really obvious, I’m embarrassed to say that I didn’t notice them on open in A.I until going back to check right now. They also seem to emphasize the “Google it”feature, which makes sense.

All that is to say that this not just a scaling problem, but a multi dimensional product problem. (we are collecting user data is a core KPI) Google has made plenty of product mistakes in the past, which is why I agree that this sits at 60%, but there’s value to all of that experience in iterating with user data

MatthewRitter avatar
Matthew Ritter

@MatthewRitter Oh wow it’s way faster than GPT-4 and lets you export code to Google Colab That means they get feedback about whether it works, and how users had to edit it.

On the other hand, the core of the code just calls a custom function and leaves it to me write the implementation!

MatthewRitter avatar
Matthew Ritter

I hope everyone else is enjoying this journey of exploration as much as I am. An argument for NO is my experience trying to get it to write some code that GPT-3.5 handled pretty easily. Here's the transcript, emphasis mine:

Can you write a function that gives an output to the command line ever 4 seconds telling the user to breath, for 2 minutes total
Sure, here is a function that gives an output to the command line every 4 seconds telling the user to breathe, for 2 minutes total:

Python

import time

def breathe():
  for i in range(120):
    time.sleep(4)
    print("Breathe in... Breathe out...")

if __name__ == "__main__":
  breathe()

This function will print the message "Breathe in... Breathe out..." to the command line every 4 seconds. The function will run for a total of 2 minutes, or 120 seconds.

would that run for 2 minutes total?
Yes, the function will run for 2 minutes total. The range() function in Python takes two arguments: the starting number and the ending number. In this case, the starting number is 0 and the ending number is 120. This means that the function will iterate 121 times. The time.sleep() function takes one argument: the number of seconds to sleep. In this case, the number of seconds to sleep is 4. This means that the function will sleep for 4 seconds between each iteration. Therefore, the function will run for a total of 2 minutes, or 120 seconds.

YoavTzfati avatar
Yoav Tzfatipredicts NO

@MatthewRitter Thanks for sharing your experiments!

Clarifying in case you missed it - this market is for Gemini which is still in training, the current bard is on PaLM 2

MatthewRitter avatar
Matthew Ritter

@YoavTzfati Yes, thanks – I think I started with that firmly in mind (as the UI likely won’t change much), and then was just reporting my stream of conscious reactions without thinking about whether they were still relevant to the yet unpublished model

YoavTzfati avatar
Yoav Tzfati

Similar market for PaLM 2: