If the experiment in the Chinchilla paper is repeated by a credible actor before 2026, will the exponent in the data scaling term be greater than 0.32?
Basic
6
Ṁ134
2026
66%
chance

This question intends to test whether data efficiency will significantly improve over the next few years in machine learning.

The Chinchilla paper from DeepMind provided a parametric loss function as a function of parameters, and training tokens. The term for data takes the form of B * D^{-Beta}. The exponent Beta was estimated to be 0.28 after fitting the loss function to the data using the LBFGS algorithm with Huber loss. The value Beta is particularly important because a larger exponent would reveal more efficient learning. A larger value for B would also be significant, but it would only provide a constant speedup.

Assume for the purpose of this question that the Chinchilla paper is replicated before January 1st 2026, in the sense that people attempt to fit their parametric loss function to new data, including using a potentially more efficient architecture. If that does not happen, this question will resolve to N/A. This question will resolve according to the first such credible replication that I become aware of.

The Chinchilla paper is said to have been replicated if an actor fits the parametric loss function to the relevant data, and publicly releases their results, in a way that I personally judge to be be credible. This actor does not need to use the exact same transformer architecture, or the specific LBFGS algorithm or setup that was detailed in the original paper.

However, a similar training data distribution will be required: in particular, it must use data that was broadly obtained from crawling the internet, even if it is not exactly the same distribution as that detailed in the Chinchilla paper. If most of the data comes from a very specific part of the internet, like arXiv or Wikipedia, then that result will not count. I will decide at my sole discretion whether the new training data distribution counts as sufficiently similar to the data distribution in the Chinchilla paper.

This question resolves to YES if the best estimate for the value for Beta in the new paper exceeds 0.320, and resolves to NO otherwise.

Get
Ṁ1,000
and
S3.00
Sort by:

Just to double check: this is a replication of the scaling law methodology, not of the underlying architecture/data? i.e. if someone builds a different architecture (doesn't replicate the Chinchilla LLM) but then does replicate the procedure for estimating the scaling law, that counts?

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules