
Related questions


@jacksonpolack It's indicative of the fact that Google is trying to bend the truth about the model's capabilities.

You see that GPT-4 is actually better if you don't cherrypick. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

Context is everything? They are arguing for a particular way to do CoT. This all gets baked into the model.
@jgyou No it doesn't. CoT is a prompting approach, not something you bake into a model. Just as 3-shot is just referring to giving it 3 examples beforehand.

@PatrikCihal They don't have this data point for GPT-4 so this comparison isn't apple to apple.
@jgyou the issue is that Google can shop around, trying approach after approach, until they find one that lifts Gemini's score while depressing GPT4's. OpenAI had no such luxury. They couldn't try approach after approach until they found one that beats Gemini—obviously, Gemini hadn't even been trained yet! So this stacks the deck against GPT4.
I think the one-shot and COT results are more meaningful, as opposed to weird model-specific benchmark hacking.

What would you say if it turned out GPT4 actually beats Gemini ultra, and then it turns out Google actually has a a secret version of Gemini called "Gemini Super Saiyan" that's way better that they're indenting to release in 2025? On what principled ground would you say that should be the basis for comparison?
If there are millions versions released, year after year, that are all beat by GPT4, but only in the year 3240 is there a version Gemini released that beats GPT4, would that be the version taken as basis for comparison?

@LEVI_BOT_1 The difference is that Gemini Ultra and its benchmark scores are already here, it's just not fully available yet. That's not the same as if there was some other better version that came in the future.

@Shump The point of my question is to understand the principles you're basing your decision on (seemingly it's not just me that's confused about this). It's not meant to be a rhetorical trick. I would also argue that as a the market creator you have a certain responsibility to be transparent and consistent in how you make your decisions. Currently I'm not able to predict how you would answer my question, and it seems quite relevant to how this market should've been resolved, which is why I ask you again: please don't be difficult and just answer the question.

What would you say if google didn't intend to release Gemini ultra before 2030 or 3000? Would that still be the version of Gemini taken as basis of comparison against GPT4? What's the cutoff?
We know you think Gemini was released since you resolved a market about that to YES. But now you're taking the basis for comparison to be a model that has not yet been released. I understand your argument about "model class", but I don't think it makes any sense given that model capabilities are strongly dependent on scale.

Another question: what if OpenAI has an internal yet-to-be-released version of GPT4 that clearly beats Gemini ultra that they're intending to release in the middle of next year, will that be the model we use to compare against Gemini ultra? Seems like that should hold given your argument about comparing "weight classes".

“On release” means at time of release. So I’d take it to mean the first Gemini model for both markets.

Hang on, what? You resolved the "will it be released" market YES on the basis of nano/pro being released, but you're going to wait for Ultra for this one?
If the spirit of this question is to wait for Ultra, then so was the other one!
(FWIW I think both should be about Ultra, but at least be consistent!)

@chrisjbillington It seems like a contradiction but I actually think it aligns with common sense. Release refers to any version of the product, in the same way that you wouldn't say that GPT4 wasn't released until recently just because Turbo was released recently. But when you're asking about whether a product can beat competitors upon release, you are implicitly referring to the best version of said product.

@Shump I don't see the logic there at all. If this is the release, then the relevant capabilities are the capabilities now. If they're not the relevant capabilities, then it wasn't released.
Of course I'm interested in comparing Ultra to GPT-4, but that's just an argument about why the "was it released" market should resolve NO, it's not an argument about why the two markets should be talking about different models.
Suppose GPT-4 was announced as being released in March, and people had been betting on the capabilities on release, and then GPT-4-turbo was released, and way better, a few months down the line? In what way is that not analogous to this situation, and if it is, isn't it much more common sensical to use the actual model available on release for the purpose of the resolution, rather than the promised model that isn't available?

@TheBayesian If GPT-4-turbo was available internally, with released benchmark scores, at the time of release of the base version, I would say it would make sense to say that capabilities questions are referring to it, yes.
@Shump fair enough. i lean toward no but that seems like a fair perspective
Related questions








