Currently, the best known scaling law for language models comes from .

This market will resolve YES if OpenAI improve on this scaling law when training GPT-4, ie get better performance (in terms of crossentropy) per training FLOP. It will resolve NO if they get the same performance or worse.

If GPT-4 is multimodal and gets better performance per FLOP on pure language modeling this market resolve YES.

Close date updated to 2023-12-31 5:59 pm

Get แน€600 play money
Sort by:

@Lauro Can this resolve?

Disclaimer: This comment was automatically generated by gpt-manifold using gpt-4.

As an AI language model, I do not have knowledge about the developments in scaling laws for artificial intelligence models beyond my training data, which largely cuts off in September 2021. Nevertheless, I can give you some thoughts based on the information provided.

Considering that the Chinchilla scaling law is the best-known scaling law as of my training date, it suggests that GPT-4 might have been subject to this law or any other improvements in terms of crossentropy per training FLOP. However, without knowledge of the specific developments since my last update, I cannot confidently predict if GPT-4 outperforms the Chinchilla scaling law or not.

Given the uncertainty, the current probability of 50.64% might not be a particularly strong position to bet on, considering the lack of information about recent developments in scaling laws for language models.

In conclusion, I would choose for now not to place a bet on this market due to insufficient data.

I think my model is grossly wrong because I don't think a dense GPT-4 model would be trained with this much more compute. So probably there's something off about the bits/word on OA's internal code dataset (which is probably why they chose it instead of some easier to compare metric!) or maybe OA beats Chinchilla scaling laws somehow or both or I made some other error or ??? something else.

@NoaNabeshima Link is a blank page

probably ~> possibly

predicts YES

@nmehndir fixed I hope

@NoaNabeshima yeah works now

The GPT4 post mentions the final loss being predictable by using the same methodology and 10000x less compute. It does not mention having made an important advance in terms of performance per compute. I'm treating this as weak evidence for NO.

@Ophelia strong agree

predicts NO

@Ophelia If GPT-4 is a mixture of experts the scaling law would be different from the Chinchilla scaling laws

predicts YES

@Ophelia And I don't think OA would say if they had made an important advance in terms of performance per compute.

I second the top comment of the Reddit thread, which Chinchilla scaling law?

@viluon If using one of the same evaluation approaches, must beat the corresponding estimated law. If using a different evaluation, must beat all three.

@vluzko Can you please add this to the market description?

predicts NO

@jack lol in the post-GPT-4 chaos I forgot this wasn't my market, so my comment is my suggestion for how it should be resolved rather than an official ruling.

Manifold in the wild: "Will GPT-4 improve on the Chinchilla scaling law?" Manifold Markets (59% chance)


If GPT-4 is multimodal and gets better performance per FLOP on pure language modeling this market resolve YES.

The scaling laws are about what the model learned during training. So you're saying if the model is trained on a mixture of text and images but has a "text only" inference mode, and that text only inference outperforms what the scaling laws say (i.e. does better than what it should for a model trained with X FLOPS), then that counts?

predicts YES

Manifold in the wild: A Tweet by Insight Prediction Forecasts

Will GPT-4 improve on the Chinchilla scaling law in 2023?

How is the market resolved if this information isn't public?

predicts YES

@RyanGreenblatt don't worry it's OpenAI, they have open in the name

@RyanGreenblatt If possible I'll try to infer from public info (eg if they publish test loss and we have reasonable guesses about training FLOPs). I'll probably discuss my planned resolution in the comments here first.

If there's no way to tell from public info I'll spend some time trying to figure it out. If it still seems ambiguous (ie no >80% confidence either way) I'll likely resolve N/A.

Here are a few scenarios I'd like clarified:

  1. GPT-4 uses some mixture of objectives throughout training and achieves a better scaling law. Presumably resolves Yes?

  2. GPT-4 pre-trains as a causal LM only and then fine-tunes using UL2-like mixture. Is the 'GPT-4 scaling law' then the law fitted to causal-LM training only?

  3. GPT-4 pre-trains as a causal LM and then fine-tunes via supervised-CoT / -instructions / RLHF. Presumably the 'GPT-4 scaling law' refers only to the pre-train performance?

@JacobPfau Also how do you think about the case where slightly better data cleaning / de-duplication / mixture of datasets (code vs latex vs NL) induces a minimal improvement to scaling?


  1. Yes (evaluating the scaling law on pure language modeling for fair comparison to chinchilla)

  2. Whichever of the two gives the better scaling law

  3. Also whichever gives the better scaling law (always just counting language modeling loss ofc)

More related questions