This question resolves YES if GPT-4 has enough data to roughly match the best-known scaling laws prescriptions known at the time of the training of GPT-4. Currently, this would mean following Chinchilla scaling laws. By roughly, I mean that it can be off by 20%. That is, if GPT-4 is 100B parameters, which would prescribe 12T tokens as per (currently known) optimal scaling laws, GPT-4 would need to be trained from ~10T to ~14T tokens for this question to resolve positively.
To expand on my comment: we can see from https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/ that smaller models keep improving as trained longer. Given that inference costs are high for OpenAI, it makes sense for them to train to minimize inference + training costs rather than train costs only, which means a smaller-than-chinchilla-optimal model is best.
@Lauro It doesn’t matter whether it does better or worse per flop than chinchilla scaling, as long as it is trained roughly compute-optimally according to the known scaling laws at the time. If a much-better-than-chinchilla scaling law is discovered, then it could very well be that GPT-4 is trained more compute optimally than Chinchilla yet doesn’t abide to known optimal scaling laws.
@BionicD0LPH1N got it!
Does "known" mean "publicly known" here?
If the better scaling law is discovered by openai and used to train GPT4, does that count as YES (bc that's the new best known law) or NO (bc the scaling is better than the best publicly known at the time of training)
@BionicD0LPH1N there is this article https://www.datacamp.com/blog/what-we-know-gpt4
not necessarily 100% reliable