When will a general AI model be released that represents a significant reasoning improvement over GPT-4?

Plus

Ṁ2357

resolved Sep 12

100%13%

September 30, 2024 or earlier

25%

October 1 - December 31, 2024

29%

January 1 - March 31, 2025

12%

April 1 - June 30, 2025

21%

July 1, 2025 or later

Since GPT-4's launch in March 2023, there hasn't been any model released that has represented a large improvement in reasoning over the various versions of GPT-4. Claude 3.5 Sonnet is the closest, but in my opinion doesn't represent an improvement similar to that between GPT-3 and GPT-4, or even GPT-3.5 to GPT-4.

This market is an attempt to quantify when a model will next represent an impressive reasoning improvement in the same way that GPT-4 was.

By "general AI model" I mean a model that you can talk to in a similar way to today's LLMs, and that exhibits similar reasoning ability across a wide range of topics, not something like AlphaProof.

This is a somewhat subjective market, so I won't bet in it. I'm looking for a significant reasoning improvement, not "we got 3% better on MATH" or anything involving usage of external tools like a code interpreter. I also don't care about other types of improvements, such as including other modalities or improved prompting techniques, for the purposes of this market. I will use benchmarks as well as the general impression that people have of the model. The general improvement I am looking for is something like the GPT-3 to GPT-4 jump. I would expect GPT-5 to satisfy this (but it might not!).

The model must be released, not just announced. It's okay if I don't have access, but some random members of the public need to have access.

I'll resolve to my opinion, and not a poll of Manifold or whatever, but I expect my opinion to mostly be in line with what a poll would say. Feel free to ask questions of the form "would a model that does X resolve this market".

This question is managed and resolved by Manifold.

#️ Technology

#AI

#Technical AI Timelines

#OpenAI

#GPT-5 Speculation

Get

1,000

and

3.00

6 Comments

25 Holders

42 Trades

Sort by:

o1-preview resolves this. I trust OpenAI to not have cheated on benchmarks (at least, more so than other people do) and with a few personal prompts of testing it seems clearly better than past models, especially on specifically reasoning.

bought Ṁ100 YES

What if, over the next ~year, we get various models that each improve on the last, such that the final model would be considered significantly better than GPT-4, but there wasn't really a discrete moment when one model became much more impressive than all other models? In this case, it might not really feel like there was a "jump," but would you just pick a cutoff that feels right to you?

A big jump isn't required (though I hope there is one because it will make it easier to resolve the market). I think with Claude 3.5 Sonnet we're about a third of the way there. If we get a bunch of models that are all slightly better than the previous, then I'll pick a cutoff somewhere along there, at the place where the distance seems to be roughly of the same magnitude as GPT-3 to GPT-4. That said, if this happens I think that this market will likely resolve to the July 2025 option - it seems really unlikely to me that we get both a smooth progression upwards, and that it only takes 10 months from here to theoretical GPT-5 level.

Would a large jump in ELO on the LLM arena leaderboard count for this?

https://chat.lmsys.org/?leaderboard

I believe the latest version of GPT-4o is at 1314, and the latest version of GPT-3.5 is at 1117. So maybe a score somewhere around 1500 would count?

bought Ṁ2 YES

I think the leader board should be totally ignored.

I would guess that a model which gets a score around 1500 would probably satisfy this. But it might also be possible to game that metric somehow. I’ll definitely look at that leaderboard, but would expect it to also show improvements on other benchmarks. (1500 also might be too high, if it turns out that most human prompts are not good enough to differentiate models at that level)

Related questions

Related questions