"Humans begin using speech to pass on what they've learned within a lifetime and then immediately become superintelligent (compared to other animals)" and "AI begins using continual learning to pass on what they've learned in-context within RL and deployment and then immediately becomes superintelligent" don't analogize perfectly, but it's close.
Will ASI happen less than 365 days after a frontier-ish AI company deploys better-than-nothing continual learning?
N/A if ASI happens first
Update 2026-05-29 (PST) (AI summary of creator comment): Continual learning is defined as models being able to learn new things at the weights level without being retrained from scratch. Key distinguishing features:
Current training loops (retraining from scratch or from base model) do not qualify
A rough indicator: continual learning would reduce the time between models knowing new things to under ~10 days (vs. current ~40-day release cycles)
Creator will go with community consensus on whether a specific system qualifies
Update 2026-05-29 (PST) (AI summary of creator comment): Continual learning does not require per-user weight modification — it can still qualify even if all users receive the same set of weights from the provider. Provider-level updates are sufficient.
Update 2026-05-29 (PST) (AI summary of creator comment): The creator clarifies that solving catastrophic forgetting and converting in-context learned knowledge into weights-level knowledge are distinct concepts. The latter would require better sample efficiency and is not the same as continual learning as defined for this market.
People are also trading
Generally speaking the more complex and novel the mechanism for training, the larger the gap between the perceived and theoretical performance versus its performance in practice. The rate of ground-breaking discoveries hasn't slowed much, but on a decade-by-decade timeline, has stalled or isn't distinguishable from natural variation (noise) above the baseline. What this suggests is that ASI, assuming it is reliant on continual learning under an AGI model, will be a standard step function up from AGI. If performance improvement becomes harder with time, and it appears to be the case, then even with a continually learning model, the best time-to-ASI we can expect once continual learning is mainstream, is linear. In other words the odds are better than even that it will take an average of 5-10 years after CL is fully understood and exploited.
@0xseraphim I'll go with whatever the consensus is. Right now companies retrain models from scratch (or at least from the base model) in order to add new data; the main feature of continual learning is that models can learn new things on the weights level without having to be retrained from scratch. Plus current model releases are ~40 days apart; continual learning would reduce the time between models knowing new things down to under ~10 days.
@Interrobang so continual learning / weight modification within a user project would not be a requirement? Provider-level RL post-training and releases are sufficient?
@0xseraphim Correct; it can still be continual learning even if everyone gets the same set of weights from the provider.
@0xseraphim The problem with standard MLP is catastrophic forgetting. lora and dora help, but they have their own problems. It is partly why mixture-of-experts has become mainstream, as has agent-collaboration frameworks.
@Interrobang If continual learning (CL) increases the speed of research, and no break throughs are made in either sample efficiency, or out-of-distribution generalization, then the answer is a solid no. Strong sample efficiency would lower training data requirements (and compute requirements, and possibly model-size requirements) translating into a weak yes. Any strong results on out-of-distribution generalization combine with even a better-than-nothing CL environment, translate to a strong 'maybe' at best.
@DavidAttenborough Hm, you're right. I was mentally conflating "solving catastrophic forgetting" with "converting in-context learned knowledge into weights-level knowledge" when the latter would require better sample efficiency.
@Interrobang thats fair. Where the rub is introduced is converting in-context learned representations to a format amiable to direct representation in network weights leads to destructive interference. One update from the context might improve performance on some task A, but modify weights critical to another task B, such that performance on task B is degraded or destroyed. It's the current problem encountered with too many LoRas, or the improvement on it, DoRa. Theres a lot of experiments and papers trying to solve this with varying degrees of success, but none of them are there yet. And while it is true sample efficiency improves learning at the network level (in the weights), it doesn't directly provide a mechanism for non-destructive weight updates. Incidentally, for the experiments exploring converting context to learned weights, sample efficiency is an underexplored metric under that precise regime, n-shot metrics in say, language models, notwithstanding. Better sample efficiency at a certain critical threshold for the network level at least implies better out-of-distribution generalization in the level above, in the autoregressive inference. Solving upstream solves downstream. It's why better tokenizers lead to better scores even when the rest of a model hasn't changed, it's about conditioning the data with useful priors that 1. smooth lots of ridges and local minima that correlate to gradient clashes, 2. which have a spectral characterization approaching blue noise. The general principle is the same, the earlier in the pipeline optimization is applied, the more general the improvement in components later in a model because the model has to do less work in weight-space on conditioning the data and extracting structures and priors. Learning at the weight level from extrapolations performed at the inference level is starting as late as possible in the pipeline, which works against this principle. It's plausible it could work anyway, which would be, as you wrote, the 'better than nothing' regime. The question is if better-than-nothing in-context-learning generalizes sufficiently under covariate shifts. A proof of that is sufficient, without going all the way, to say whether a weaker model (or ensemble of such models) is enough, by itself, to lead to AGI, which is at least assumed to lead to ASI by default, or whether it is insufficient, which would be weak but positive evidence that optimization earlier in the pipeline is the direction research has to take to cross the finish line. Man, I'm loving the market you posted more and more.
edit: To be clear the blue noise comment is an analogy that is still waiting on research to verify it, but its sound in theory. The ICL to AGI pipeline is speculative, but the entire premise my argument hinges on is "if any direction, early or late stage optimization general is most responsible for AGI, where I define AGI here as o.o.d generalization, which will be the core contributing research direction? Late or early optimization?" My argument says it will (mostly) be early optimization, while your argument is the biggest contributing factor will be late-stage optimization. It'll probably be a mix of both realistically (because most pipelines at the cutting edge optimize all stages to varying degrees), but the question remains, before the 'big break' into AGI, and then (completely assumed) ASI, what will be the defining change, a major shift in early stage optimization, or a major improvement in late stage?