I’m going to do my best to make this rigorous. This one is a bit fuzzy but it’s what I want to know the answer to.
This will resolve to yes if:
by the end of the day on 12/31/2024
a new class of ai model is released by anyone (not just OpenAI). The model must be in wide availability. Some gates are acceptable (e.g., paid users only, 10% of users), but it cannot be available to selected AI influencers.
that is a step change better in performance. For clarity: GPT-4 was a step change better than 3/3.5 and would qualify; Claude 3.5/GPT 4o are narrowly better than GPT-4 and would not. I will use my best judgement to resolve this honestly using all inputs available (benchmarks, test cases, user reports, reviews by expert users).
Names don’t matter here. It could be called GPT-1 but if it’s obviously way better than GPT-4/Claude 3.5, then the market resolves to yes.
Given the ambiguity here, I will not be betting on this market.
Okay folks, I'm starting to get back into a rhythm post baby and going to take a close look at o1 (at least this is what I tell myself).
Over the next several weeks, I'm going to make an assessment of o1 against the criteria above.
If you have reviews or analyses of it that you think are particularly persuasive, point me to them (I'll be looking at the ones in the comments as well).
Thanks for understanding on the delay. Timing could not have been worse!
@ismellpillows No.
I use it for work, and it is not substantially better than the others.
i.e. even if you can set up technical benchmarks where it is better, it is not much better for real world use cases. And that is absolutely not because there isn't a meaningful way to be substantially better for those cases; there certainly is, and I would recognize it if it happened.
@DavidBolin @ismellpillows fwiw, this is why I haven’t resolved the market yet.
On one hand the benchmarks look compelling but on the other hand I don’t see people flocking to it the way they did 4 (vs 3.5 and Claude).
I notice this both in my behavior and in the smart people I observe. So at the moment I’m trying to gather more information to make this more definitive and less of a pure judgement call.
@JamesBaker3 I personally don't see it at the level of the 3.5 to 4 step change. GPT-4 cut error by ~50% across the board on all benchmarks - this is mostly about driving up math and (to a lesser degree) logical reasoning.
Win rate over gpt-4o hovers around 58%, with no gain for text/writing. That's about half the ELO difference between gpt-3.5 and gpt-4.
The Information article on strawberry even noted some testers felt the pause isn't worth the increased smartness (at least in some categories). I feel that using o1 for even reasoning tasks -- unlike with gpt-4, where I fully switched accepting the slowness.
@JamesBaker3 It obviously does not.
But the fact that someone can even claim such nonsense is good reason for people not to bet in this market. In contrast it is very much a reason to bet in the market about "GPT-5" coming out this year, which will not happen. OpenAI is not going to tarnish their brand in that way (by calling something GPT-5 when it is at most a slightly improved GPT-4).
@DavidBolin "does not" what? I'm guessing you mean the "step change" part, because "a new class of ai model" seems really really clear. James D gets to decide what "significantly better" (from the title) vs "narrowly better". I think that even if o1 mini & preview don't cross that line, their main o1 does/will (and will be even more so before 12/31).
@jdilla Sounds like the main improvement is training to do chain of thought without prompting, which doesn't really strike me as a step-change. https://x.com/GaryMarcus/status/1834293745782870488?t=qFvvCMC71-pJ3EWSjmODcg&s=19
@WilliamGunn You are quoting Gary Marcus saying "It is definitely impressive" and counting that as evidence against?🤣
@WilliamGunn Bet against me then! 😂 I'm not planning to argue much here, I've made my bets and will trust James D to resolve in line using his best judgement in line with the spirit of the question.
I don't like this market's design, so I'm not bidding.
My main issue is that a model with 10% of people randomly being allowed access to it is not publicly available. OpenAI has done this repeatedly with "beta" programs for GPT-4o and its previous models, or when it announced memory and that was rolled out to increasing numbers of users over weeks.
There's a reason why I use Anthropic models - they announce something and it is available to everyone who wants to pay for it on the same day. And, Google's AI studio allows immediate testing for everything they announce. These companies have actual models that can be used, while a model that isn't available to anyone who wants to pay is not publicly available.
Totally fair! I struggled with that criteria as well.
Given OpenAI's rollout process, I felt I would be unable to prove that any release of theirs was available to all users, but would be able to tell if it was invite only (as in, Ethan Mollick has it but no one else does).
With that said, if you have better ideas on how to improve this, I'll take them!
This contains a several false statements.
Anthropic Claude 3 included the announcement of Haiku, but it wasn’t available until later. 3.5 Haiku and Opus are announced but not available.
Google Gemini Ultra was never made available to users. Pro 1.5 was provided to small cohorts, and 1M and 2M context lengths were also rolled out slowly after their initial announcement days.
It’s fair to be upset about not having access to something that you want, but it’s inaccurate to state that OpenAI is the only one doing progressive rollout launches.
I am specifically referring to instances where a model is demonstrated and made available to a small number of people. GPT-4o voice is an extreme example of vaporware like that.
None of Anthropic's products fit this criteria. They did state that other 3.5 models will be available in the future, but they never claimed they are ready yet and nobody has ever had access to them. They aren't making videos that depict 10 "beta" users running 3.5 Haiku or creating a "waitlist."
With Google, the Pro context lengths are indeed available to all users on the AI studio, which anyone can get access to if they know howto do it, and they actually were available to any user who wanted them.
So that leaves Gemini Ultra, which indeed is an example of Google promising something that wasn't delivered. I would say then that Anthropic is the most trustworthy and customer-focused, followed by Google, and then followed by OpenAI.