
I will run a Manifold poll 1 month after the official GPT-5 release asking whether or not it exceeded expectations. Resolves to results of that poll.
Update 2025-22-01 (PST): - The market closure date has been extended to July 1, 2025 following updates on the GPT-5 release schedule. (AI summary of creator comment)
People are also trading
https://manifold.markets/strutheo/will-nvidia-stock-nvda-be-worth-188
This market should be correlated with this one, no?
@jessald Very loosely at best, I hardly expect that the performance of one model is going to have a big effect on Nvidia's stock
In an interview Dario said that the average lay person wouldn't be able to appreciate an AI with for example PhD level chemistry knowledge because they simply wouldn't understand or even ask for information that would make use of or reveal this knowledge I feel like gpt5 could be an iceberg and most of the respondents in the future poll will only see the surface.
@jessald Everyone knows something & they could test it on that. That works if you're a car mechanic too. Besides, you'd hope to see better instruction following
@jessald Dario's point is 100% true in theory but in practice I feel the opposite has happened, at least in some forums: people unable to actually judge models' performance in, say, advanced mathematics, go around claiming models are way better than they actually are.
@NBAP They should use 4.5, no?
Everyone just memory-holed 4.5! It's such an awful model, I keep trying to use it and it really can't hold a candle to o3 and it's just as slow.
@No_uh In principle yes ... but expectations are pretty vibe based, we are not living in a nation composed of Bayesian rationalists here
@BorisBartlog ya but this will be resolved according to manifold poll, and also gpt-5 reactions will be a multi-national thing. anyway i consider the likelihood people vote according to their profit opportunity here a much larger complication than the unawareness or awareness of how closely it aligns with METR scaling projections
@Bayesian What's your model for what this will be? 4.5 is still too slow to use and the iq gain is not worth the time. If this is an oom (or 2) bigger shouldn't it be even slower?
@FergusArgyll main trained model probably 4.5 or around that size, but significantly more MoE (they are all MoE), gpt-5-mini is the generally accessible model and is significantly smaller, gpt-5-pro is parallel gpt-5s like o3-pro for o3, and a lot of the capability comes from scaling RL with verifiable rewards and ai judge based RL and similar, which doesn't require the model to be as big to pack a lot of capabilities. it is definitely, definitely not 2oom bigger, but it could be as much as an OOM bigger, that's not completely implausible, just makes RL harder to work with and seems generally very unlikely
@Bayesian Are we done scaling then? I haven't seen anyone write about this explicitly. A few years ago it was scaling laws scaling laws scaling laws, bitter lesson, just throw more compute. Is that over? if it is, why assume more progress going forward?
@FergusArgyll we're not done scaling. we just don't scale that fast in size relative to scaling eg the RL regime, for now. Like effective compute spent on models has been growing a lot, and you'd need like 10 OOMs scaled gpt-3 to get to present day capabilities (source, I made it up, it would um be nice to have a better estimate), and really you wouldn't have enough data to do it so you'd be kind of screwed. scaling laws are not over, there's just much better algorithms for making the ais smart, which can and should be modeled as the new bottlenecks, and the models are much more pretrained than chinchilla would have said, bc they are optimizing more for low inference cost than they used to
@Bayesian but could we pause and pronounce naive pretrain scaling dead? and notice that we now have a new theory which we hope will scale but which we don't have much evidence for. So this GPT-5 won't be evidence one way or another to "Scaling laws" (Gwern, lesswrong et al ~2022). I'm actually confused about this. part of the problem is as you hinted we have no clue how many params gpt-4 or 4.5 or o3 or opus 4 etc have, it really bothers me but I just want some kind of clarity - Scaling laws are dead, we have killed him?
@FergusArgyll Scaling laws are not dead and will never be dead ignoring precision related implementation details which don’t seem material. They’re just a property of these kind of pretraining algos, that they get better with scale with predictable scaling properties as long as the data scales in a particular way that we can expect it to. The upper limits to scaling compute in terms of physical possibility are very very high. It’s just that naively scaling the size of the models is now not strategic (in fact, it’s very stupid) in a way it wasn’t previously. We won’t get AGI by scaling pretraining, not because we can’t, but because we have better stuff now
@Bayesian So do you expect the modern strategies (test time compute, maybe MoE too) to give us the same performance as naive scaling would have? e.g if scaling was going 2 ooms per year and that meant 10 points on MMLU[0] should o4, o5, o6 continue to give us that?
[0] prob outdated but some similar objective measure, maybe perplexity
@FergusArgyll perplexity was def the metric that mattered for pretraining, but doesn't rly work for post training (afaict), and nothing else has nearly as good properties as it, so imo we are kinda doomed and without really good robust measures of improvement for a while, but that's ok. frontier training compute has been going up 5x per year, so 0.7 OOM per year, not 2. test-time compute, MoE and other algo improvements are scaling roughly 3x per year, maybe more, so obviously it already is and has been for a while much much better than naive scaling would have. (this also varies by tasks, as in some tasks are bottlenecked by reasoning and others aren't)
I think mostly the straight lines will continue in expectation, some will bend and whatnot, but yeah the ais will continue getting smarter for a set amt of compute, get cheaper for a set amt of capability, and get smarter at the frontier, in a way that is non-trivial to compare to similar trends years ago but you can kinda get close to doing that, eg https://epoch.ai/trends#compute