Related questions
I've closed this to trading.
The creator is inactive, and to my knowledge we aren't actually waiting on anything for the market to resolve - resolution is supposed to be about the state of things at the time Gemini (ultra) was released.
So this should be resolved now. Feel free to discuss how you think it should resolve and present your evidence. If it's clear I can resolve it and if not then it will go to a three-mods vote.
I'll leave some time for discussion and then post to discord requesting a mod vote if needed.
Alternatively, if @brubsby is around and seeing this ping, they can resolve themselves.
@Joshua I'm quite happy to take mana off the table and hand the risk to @chrisjbillington
The MM has a history of being a bit "vibe-based" with the resolution of this specific market. Is 1.5 the same as 1? Ultra? Advanced? What version of GPT 4 are we talking about now? It really could go either way.
Top top that of @ZviMowshowitz and others are reporting that their experience with Gemini is quite good, superior to GPT 4 in many regards.
@jgyou The market says "on release", and whilst there were gripes about whether Pro would count for this (brubsby's market on whether "Gemini" would be released resolved YES on Pro being released), I think it's very clear a 1.5 release would not count. This market is about whether Gemini Ultra right now is better than GPT-4 right now.
Recent data point not yet posted here: Bard with Gemini Pro score below GPT-4 turbo on chatbotarena
https://twitter.com/lmsysorg/status/1750925807277781456
@brubsby will GPT-4.5, if such a thing is released before Gemini Ultra, count as GPT-4 for the purposes of this market, or no?
https://github.com/microsoft/promptbase
GPT4 + improved prompting gives 89% on Big-Bench-Hard.
(Evaluated Few Shot + CoT)
@jgyou I don't get these results. Why are they getting much better results for GPT-4 with what is reportedly the same methodology as the one reported in the Gemini report? Also, it seems like they just copied the results from the Gemini report, so they didn't actually run Gemini using their benchmarking code.
@Shump Yes, those are just the same Gemini Ultra number (maybe they don't have access yet?). The point is that some version of GPT-4 can be prompted to SOTA when MSFT plays the prompting game.
So I'm not so sure that Ultra is superior anymore.
@Shump >Why are they getting much better results for GPT-4 with what is reportedly the same methodology as the one reported in the Gemini report?
The test results in the Gemini paper are for the June GPT4. These new results are for the current Turbo model.
Either Turbo GPT4 is smarter, or (more likely) additional test answers leaked to the internet and are now in GPT4's training data.
@Coagulopath I think if Turbo is THAT much smarter OpenAI would have been advertising that.
Can someone more familiar with LLMs than me say whether there can be variance in how exactly you n-shot or CoT prompt an AI? I think there are many different potential ways to do either but I'm not sure.
@jacksonpolack It's indicative of the fact that Google is trying to bend the truth about the model's capabilities.
You see that GPT-4 is actually better if you don't cherrypick. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
@jgyou No it doesn't. CoT is a prompting approach, not something you bake into a model. Just as 3-shot is just referring to giving it 3 examples beforehand.
@PatrikCihal They don't have this data point for GPT-4 so this comparison isn't apple to apple.
@jgyou the issue is that Google can shop around, trying approach after approach, until they find one that lifts Gemini's score while depressing GPT4's. OpenAI had no such luxury. They couldn't try approach after approach until they found one that beats Gemini—obviously, Gemini hadn't even been trained yet! So this stacks the deck against GPT4.
I think the one-shot and COT results are more meaningful, as opposed to weird model-specific benchmark hacking.
@PatrikCihal The fact that Google is tweaking the definition of "beat" is really saddening