Most (unsaturated) benchmarks I respect imply linear progress in latent capabilities. An exception is the celebrated HCAST measure of AI agents' task time horizon, which is firmly exponential over 6 years of releases and at all success thresholds. (I'm trying to ignore people being highly naive about the absolute time values under the default 50% threshold.)
There’s much to be said against this, and it has been said, including by METR. Among the many reasons to doubt that this generalises is that the original HCAST tasks are just about greenfield software development and low in "messiness". (And most actually useful tasks are high in messiness.) But in the paper they argue that the improvement on n=22 messy tasks is "not obviously slower".
The July update adds a few other domains, mostly still not very messy. The exception is Tesla self-driving; it is growing more slowly than software and maths, but still on an exponential.
Resolution: at the end of next year, will I put >66% that tasks with messiness > 3 are improving in time horizon on the same exponential as clean tasks? That is, is the doubling time of the m>3 tasks within a factor of 2 of the doubling time of the m<3 tasks?
My current credence (Dec 2025): 25%
If you want to use a model of me as well as your model of AI to answer, here are some of my views.