Will GPT-4 improve on the Chinchilla scaling law?
50%
chance

Currently, the best known scaling law for language models comes from https://arxiv.org/abs/2203.15556 .

This market will resolve YES if OpenAI improve on this scaling law when training GPT-4, ie get better performance (in terms of crossentropy) per training FLOP. It will resolve NO if they get the same performance or worse.

If GPT-4 is multimodal and gets better performance per FLOP on pure language modeling this market resolve YES.

Close date updated to 2023-12-31 5:59 pm

Sort by:
NoaNabeshima avatar
Noa Nabeshimabought Ṁ100 of YES

https://www.getguesstimate.com/models/22241

I think my model is grossly wrong because I don't think a dense GPT-4 model would be trained with this much more compute. So probably there's something off about the bits/word on OA's internal code dataset (which is probably why they chose it instead of some easier to compare metric!) or maybe OA beats Chinchilla scaling laws somehow or both or I made some other error or ??? something else.

nmehndir avatar
nmehndir

@NoaNabeshima Link is a blank page

NoaNabeshima avatar
Noa Nabeshimabought Ṁ30 of YES

probably ~> possibly

NoaNabeshima avatar
Noa Nabeshimais predicting YES at 65%

@nmehndir fixed I hope

nmehndir avatar
nmehndir

@NoaNabeshima yeah works now

Ophelia avatar
Opheliabought Ṁ350 of NO

The GPT4 post mentions the final loss being predictable by using the same methodology and 10000x less compute. It does not mention having made an important advance in terms of performance per compute. I'm treating this as weak evidence for NO.

https://openai.com/research/gpt-4

jonsimon avatar
Jon Simonbought Ṁ5 of YES

@Ophelia strong agree

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 42%

@Ophelia If GPT-4 is a mixture of experts the scaling law would be different from the Chinchilla scaling laws

NoaNabeshima avatar
Noa Nabeshimais predicting YES at 56%

@Ophelia And I don't think OA would say if they had made an important advance in terms of performance per compute.

viluon avatar
Andrew Kvapil

I second the top comment of the Reddit thread, which Chinchilla scaling law?

vluzko avatar
Vincent Luczkow

@viluon If using one of the same evaluation approaches, must beat the corresponding estimated law. If using a different evaluation, must beat all three.

jack avatar
Jack

@vluzko Can you please add this to the market description?

vluzko avatar
Vincent Luczkowis predicting NO at 53%

@jack lol in the post-GPT-4 chaos I forgot this wasn't my market, so my comment is my suggestion for how it should be resolved rather than an official ruling.

ManifoldDream avatar
Manifold in the WildBot

Manifold in the wild: "Will GPT-4 improve on the Chinchilla scaling law?" Manifold Markets (59% chance)

URL: https://manifold.markets/Lauro/will-gpt4-improve-on-the-chinchilla

jonsimon avatar
Jon Simonbought Ṁ40 of NO

If GPT-4 is multimodal and gets better performance per FLOP on pure language modeling this market resolve YES.

The scaling laws are about what the model learned during training. So you're saying if the model is trained on a mixture of text and images but has a "text only" inference mode, and that text only inference outperforms what the scaling laws say (i.e. does better than what it should for a model trained with X FLOPS), then that counts?

Lauro avatar
Lauro Langoscois predicting YES at 61%
ManifoldDream avatar
Manifold in the WildBot

Manifold in the wild: A Tweet by Insight Prediction Forecasts

Will GPT-4 improve on the Chinchilla scaling law in 2023? https://manifold.markets/Lauro/will-gpt4-improve-on-the-chinchilla

RyanGreenblatt avatar
Ryan Greenblatt

How is the market resolved if this information isn't public?

FranklinBaldo avatar
Franklin Baldois predicting YES at 49%

@RyanGreenblatt don't worry it's OpenAI, they have open in the name

Lauro avatar
Lauro Langosco

@RyanGreenblatt If possible I'll try to infer from public info (eg if they publish test loss and we have reasonable guesses about training FLOPs). I'll probably discuss my planned resolution in the comments here first.

If there's no way to tell from public info I'll spend some time trying to figure it out. If it still seems ambiguous (ie no >80% confidence either way) I'll likely resolve N/A.

JacobPfau avatar
Jacob Pfau

Here are a few scenarios I'd like clarified:

  1. GPT-4 uses some mixture of objectives https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html throughout training and achieves a better scaling law. Presumably resolves Yes?

  2. GPT-4 pre-trains as a causal LM only and then fine-tunes using UL2-like mixture. Is the 'GPT-4 scaling law' then the law fitted to causal-LM training only?

  3. GPT-4 pre-trains as a causal LM and then fine-tunes via supervised-CoT / -instructions / RLHF. Presumably the 'GPT-4 scaling law' refers only to the pre-train performance?

JacobPfau avatar
Jacob Pfau

@JacobPfau Also how do you think about the case where slightly better data cleaning / de-duplication / mixture of datasets (code vs latex vs NL) induces a minimal improvement to scaling?

Lauro avatar
Lauro Langosco

@JacobPfau

  1. Yes (evaluating the scaling law on pure language modeling for fair comparison to chinchilla)

  2. Whichever of the two gives the better scaling law

  3. Also whichever gives the better scaling law (always just counting language modeling loss ofc)

vluzko avatar
Vincent Luczkow

You mean better performance per FLOP than expected by the scaling law, not better than Chinchilla, right?

Lauro avatar
Lauro Langosco
ManifoldDream avatar
Manifold in the WildBot

Will GPT-4 improve on the Chinchilla scaling law?, 8k, beautiful, illustration, trending on art station, picture of the day, epic composition