GPT-5 class models are real, will be significantly better than GPT-4, and are coming by the end of the year.
➕
Plus
64
Ṁ18k
Jan 1
32%
chance

I’m going to do my best to make this rigorous. This one is a bit fuzzy but it’s what I want to know the answer to.

This will resolve to yes if:

  • by the end of the day on 12/31/2024

  • a new class of ai model is released by anyone (not just OpenAI). The model must be in wide availability. Some gates are acceptable (e.g., paid users only, 10% of users), but it cannot be available to selected AI influencers.

  • that is a step change better in performance. For clarity: GPT-4 was a step change better than 3/3.5 and would qualify; Claude 3.5/GPT 4o are narrowly better than GPT-4 and would not. I will use my best judgement to resolve this honestly using all inputs available (benchmarks, test cases, user reports, reviews by expert users).

Names don’t matter here. It could be called GPT-1 but if it’s obviously way better than GPT-4/Claude 3.5, then the market resolves to yes.

Given the ambiguity here, I will not be betting on this market.

Get
Ṁ1,000
and
S3.00
Sort by:
sold Ṁ210 YES

Selling down a bit on the prospect that o1 might not get out of preview before the end of the year.

Okay folks, I'm starting to get back into a rhythm post baby and going to take a close look at o1 (at least this is what I tell myself).

Over the next several weeks, I'm going to make an assessment of o1 against the criteria above.

If you have reviews or analyses of it that you think are particularly persuasive, point me to them (I'll be looking at the ones in the comments as well).

Thanks for understanding on the delay. Timing could not have been worse!

particularly persuasive: simply try it w/ any math, engr, scientific use case. It’s better than “narrowly better”

(preview is widely available)

bought Ṁ100 NO

@ismellpillows No.

I use it for work, and it is not substantially better than the others.

i.e. even if you can set up technical benchmarks where it is better, it is not much better for real world use cases. And that is absolutely not because there isn't a meaningful way to be substantially better for those cases; there certainly is, and I would recognize it if it happened.

@DavidBolin that’s performance on competition math and code, not really a benchmark

@ismellpillows I use it for code.

It is not significantly better for real world use cases.

@DavidBolin @ismellpillows fwiw, this is why I haven’t resolved the market yet.

On one hand the benchmarks look compelling but on the other hand I don’t see people flocking to it the way they did 4 (vs 3.5 and Claude).

I notice this both in my behavior and in the smart people I observe. So at the moment I’m trying to gather more information to make this more definitive and less of a pure judgement call.

o1 seems to count as both a new class and step change

bought Ṁ500 YES

also fits in line with JD's comment below "I hear whispers / speculation (e.g., strawberries) that something much better is coming and want a price on the likelihood on that happening by the end of the year."

I think the only potential squabble I could see is someone arguing that "a step change better" contains an implicit "and is not worse in any existing area" which seems dubious in a "best judgement" sense of this feeling more like the 3-to-4 "step" where new things became possible.

@JamesBaker3 I personally don't see it at the level of the 3.5 to 4 step change. GPT-4 cut error by ~50% across the board on all benchmarks - this is mostly about driving up math and (to a lesser degree) logical reasoning.

Win rate over gpt-4o hovers around 58%, with no gain for text/writing. That's about half the ELO difference between gpt-3.5 and gpt-4.

The Information article on strawberry even noted some testers felt the pause isn't worth the increased smartness (at least in some categories). I feel that using o1 for even reasoning tasks -- unlike with gpt-4, where I fully switched accepting the slowness.

@JamesBaker3 It obviously does not.

But the fact that someone can even claim such nonsense is good reason for people not to bet in this market. In contrast it is very much a reason to bet in the market about "GPT-5" coming out this year, which will not happen. OpenAI is not going to tarnish their brand in that way (by calling something GPT-5 when it is at most a slightly improved GPT-4).

@DavidBolin "does not" what? I'm guessing you mean the "step change" part, because "a new class of ai model" seems really really clear. James D gets to decide what "significantly better" (from the title) vs "narrowly better". I think that even if o1 mini & preview don't cross that line, their main o1 does/will (and will be even more so before 12/31).

Will be reviewing o1 vs the criteria above, but will be a little delayed - new baby came this week. Please bear with me.

@jdilla Sounds like the main improvement is training to do chain of thought without prompting, which doesn't really strike me as a step-change. https://x.com/GaryMarcus/status/1834293745782870488?t=qFvvCMC71-pJ3EWSjmODcg&s=19

@WilliamGunn You are quoting Gary Marcus saying "It is definitely impressive" and counting that as evidence against?🤣

@JamesBaker3 Did you see the part after "but"?

@WilliamGunn Bet against me then! 😂 I'm not planning to argue much here, I've made my bets and will trust James D to resolve in line using his best judgement in line with the spirit of the question.

I'm going with no still, because I think model improvement will be slow enough that there will be no obvious phase shift to "GPT-5 class model" in the way there was between GPT-3.5 and GPT-4

bought Ṁ40 YES

one of 3.5 Opus, Grok-3, or Gemini 2 will probably fit the bill for me. Idk if you will agree.

These aren't live yet, are they? As far as I can tell 3.5 Opus is still coming

(Aside: I hate Anthropic's naming system - I can never remember which one is which)

I don't like this market's design, so I'm not bidding.

My main issue is that a model with 10% of people randomly being allowed access to it is not publicly available. OpenAI has done this repeatedly with "beta" programs for GPT-4o and its previous models, or when it announced memory and that was rolled out to increasing numbers of users over weeks.

There's a reason why I use Anthropic models - they announce something and it is available to everyone who wants to pay for it on the same day. And, Google's AI studio allows immediate testing for everything they announce. These companies have actual models that can be used, while a model that isn't available to anyone who wants to pay is not publicly available.

Totally fair! I struggled with that criteria as well.

Given OpenAI's rollout process, I felt I would be unable to prove that any release of theirs was available to all users, but would be able to tell if it was invite only (as in, Ethan Mollick has it but no one else does).

With that said, if you have better ideas on how to improve this, I'll take them!

Don't change this market, as it's too late. However, in the future, I would only resolve a market as YES when the model can be used by any person in the United States who is willing to pay for the service on the same day that they sign up for it.

Yeah, definitely not going to make any changes that I believe substantially change the way the market has been defined to date. With that said, I do believe in clarifying in the spirit of what's written.

I'll consider that in the future!

This contains a several false statements.

Anthropic Claude 3 included the announcement of Haiku, but it wasn’t available until later. 3.5 Haiku and Opus are announced but not available.

Google Gemini Ultra was never made available to users. Pro 1.5 was provided to small cohorts, and 1M and 2M context lengths were also rolled out slowly after their initial announcement days.

It’s fair to be upset about not having access to something that you want, but it’s inaccurate to state that OpenAI is the only one doing progressive rollout launches.

I am specifically referring to instances where a model is demonstrated and made available to a small number of people. GPT-4o voice is an extreme example of vaporware like that.

None of Anthropic's products fit this criteria. They did state that other 3.5 models will be available in the future, but they never claimed they are ready yet and nobody has ever had access to them. They aren't making videos that depict 10 "beta" users running 3.5 Haiku or creating a "waitlist."

With Google, the Pro context lengths are indeed available to all users on the AI studio, which anyone can get access to if they know howto do it, and they actually were available to any user who wanted them.

So that leaves Gemini Ultra, which indeed is an example of Google promising something that wasn't delivered. I would say then that Anthropic is the most trustworthy and customer-focused, followed by Google, and then followed by OpenAI.

Comment hidden
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules