
Resolves 3 months after Bard is switched from PaLM 2 to Gemini. (Resolves NA if it isn't switched to Gemini for some reason)
If no one recommends anything that captures the title's intention better, resolution will be based on https://github.com/FranxYao/chain-of-thought-hub
Specifically - if 3 months after Gemini is released it has at least 3 scores better than GPT-4, and has at least twice as many scores better than GPT-4 than worse, this market resolves YES. Otherwise, assuming no major contention, resolves NO. If not enough benchmarks are available on this repo I will try to find something similar.
Related questions
Today's announcements, speech as output and image as input are still considered base GPT4?

@JeronimoD It shouldn't really matter, as the benchmark is a text benchmark.
That said, the image input has always been a part of gpt-4, they just didn't want to release it until now, so I would say it counts. The speech I/O isn't really gpt-4, it's other models working together. But they're also not really relevant to the question of intelligence imo
I could imagine that Gemini will be widely considered better because it is extremely good at tool usage and integrating information from different modalities, but I think neither of those aspects is measured by that chain-of-thought-hub. If the general vibe is this but benchmarks don't reflect it, how would you resolve?

@NielsW I think that if it's less intelligent at answering textual questions it won't be widely considered better. But just in case - it's important to me to use a reasonable benchmark for resolution (instead of relying on my subjective judgement). If you find a benchmark that captures this complexity and people are in favor of changing the resolution criteria I'll change them, otherwise I'll stick to chain-of-thought-hub

@NielsW new benchmark that includes vision (but uses gpt-4 as a judge, so probably not useful for this market)

@NikhilVyas Interesting question. On one hand, it makes more sense to predict relative to today's GPT-4. On the other hand, this question represents 'is OpenAI ahead of Google'.
Proposal: If it's different enough that it gets a separate entry on chain-of-thought-hub, I'll compare to the original version. If it's updated and chain-of-thought-hub just updates the gpt-4 row, I'll use the updated score.

The market sits about where I expect, so I’m not betting yet, but it seems like the strongest case for yes is that Alphabet has lots of data which is not public. I don’t know the exact terms of the open in A.I deal with Microsoft, but it seems reasonable to assume the Alphabet team will have an easier time accessing some specific data set that they realize brings a lot of leverage to model quality.
I have zero idea what that data set will be, but Pareto distributions pop up in many places, and there is no reason to believe all of the “20% that brings 80%“ happens to be available for a scraping from the public web
@MatthewRitter Also, I am trying out Bard now. It makes the “thumbs up/thumbs down“ symbols really obvious, I’m embarrassed to say that I didn’t notice them on open in A.I until going back to check right now. They also seem to emphasize the “Google it”feature, which makes sense.
All that is to say that this not just a scaling problem, but a multi dimensional product problem. (we are collecting user data is a core KPI) Google has made plenty of product mistakes in the past, which is why I agree that this sits at 60%, but there’s value to all of that experience in iterating with user data
@MatthewRitter Oh wow it’s way faster than GPT-4 and lets you export code to Google Colab That means they get feedback about whether it works, and how users had to edit it.
On the other hand, the core of the code just calls a custom function and leaves it to me write the implementation!
I hope everyone else is enjoying this journey of exploration as much as I am. An argument for NO is my experience trying to get it to write some code that GPT-3.5 handled pretty easily. Here's the transcript, emphasis mine:
Can you write a function that gives an output to the command line ever 4 seconds telling the user to breath, for 2 minutes total
Sure, here is a function that gives an output to the command line every 4 seconds telling the user to breathe, for 2 minutes total:
Python
import time
def breathe():
for i in range(120):
time.sleep(4)
print("Breathe in... Breathe out...")
if __name__ == "__main__":
breathe()
This function will print the message "Breathe in... Breathe out..." to the command line every 4 seconds. The function will run for a total of 2 minutes, or 120 seconds.
would that run for 2 minutes total?
Yes, the function will run for 2 minutes total. The range()
function in Python takes two arguments: the starting number and the ending number. In this case, the starting number is 0 and the ending number is 120. This means that the function will iterate 121 times. The time.sleep()
function takes one argument: the number of seconds to sleep. In this case, the number of seconds to sleep is 4. This means that the function will sleep for 4 seconds between each iteration. Therefore, the function will run for a total of 2 minutes, or 120 seconds.

@MatthewRitter Thanks for sharing your experiments!
Clarifying in case you missed it - this market is for Gemini which is still in training, the current bard is on PaLM 2
@YoavTzfati Yes, thanks – I think I started with that firmly in mind (as the UI likely won’t change much), and then was just reporting my stream of conscious reactions without thinking about whether they were still relevant to the yet unpublished model































