Since GPT-4's launch in March 2023, there hasn't been any model released that has represented a large improvement in reasoning over the various versions of GPT-4. Claude 3.5 Sonnet is the closest, but in my opinion doesn't represent an improvement similar to that between GPT-3 and GPT-4, or even GPT-3.5 to GPT-4.
This market is an attempt to quantify when a model will next represent an impressive reasoning improvement in the same way that GPT-4 was.
By "general AI model" I mean a model that you can talk to in a similar way to today's LLMs, and that exhibits similar reasoning ability across a wide range of topics, not something like AlphaProof.
This is a somewhat subjective market, so I won't bet in it. I'm looking for a significant reasoning improvement, not "we got 3% better on MATH" or anything involving usage of external tools like a code interpreter. I also don't care about other types of improvements, such as including other modalities or improved prompting techniques, for the purposes of this market. I will use benchmarks as well as the general impression that people have of the model. The general improvement I am looking for is something like the GPT-3 to GPT-4 jump. I would expect GPT-5 to satisfy this (but it might not!).
The model must be released, not just announced. It's okay if I don't have access, but some random members of the public need to have access.
I'll resolve to my opinion, and not a poll of Manifold or whatever, but I expect my opinion to mostly be in line with what a poll would say. Feel free to ask questions of the form "would a model that does X resolve this market".
What if, over the next ~year, we get various models that each improve on the last, such that the final model would be considered significantly better than GPT-4, but there wasn't really a discrete moment when one model became much more impressive than all other models? In this case, it might not really feel like there was a "jump," but would you just pick a cutoff that feels right to you?
A big jump isn't required (though I hope there is one because it will make it easier to resolve the market). I think with Claude 3.5 Sonnet we're about a third of the way there. If we get a bunch of models that are all slightly better than the previous, then I'll pick a cutoff somewhere along there, at the place where the distance seems to be roughly of the same magnitude as GPT-3 to GPT-4. That said, if this happens I think that this market will likely resolve to the July 2025 option - it seems really unlikely to me that we get both a smooth progression upwards, and that it only takes 10 months from here to theoretical GPT-5 level.
I would guess that a model which gets a score around 1500 would probably satisfy this. But it might also be possible to game that metric somehow. I’ll definitely look at that leaderboard, but would expect it to also show improvements on other benchmarks. (1500 also might be too high, if it turns out that most human prompts are not good enough to differentiate models at that level)