This cost should not include the salaries of researchers that worked on developing it, but rather only the cost of electricity + hardware. I will resolve this as best I can, based on potentially given estimates and other pieces of evidence.
What does cost of hardware mean? Does this mean calculate as if OpenAI also buys all the GPUs they use in their training run?
Does it resolve YES if in aggregate across all training runs for GPT-4 it's >$50M, or only if the last run, or at least one run costs >$50M?
@RealityQuotient My understanding is that the training tests and experiments made before training the final iteration of a LLM is never counted on that LLM’s compute cost estimate. So this question is looking at the “relevant” run, as in the one that was used to train the model they end up calling GPT-4.
If others have other opinions about how these things are counted in practice I’d be curious to hear more.
What if the cost to MSFT Azure is >$50M, yet they charge OpenAI less?
I heard a rumour that says it's only around 10M
@ValeryCherepanov What percentage would you put on that rumor being correct? Was it from someone reliable?
As I understand it, the possibilities for it to be around $10M would be the following:
They stupidly trained a LM with not enough data. (unlikely? Could be explained by the cost of data acquisition in multi-T-sized datasets)
They figured out ways to make the training super cheap and trained with Chinchilla-law-abiding numbers. (unlikely)
They somehow found ways to improve Chinchilla laws. (plausible)
GPT-4 has much less than 175B parameters. (not unlikely?)
Using GPT-3-proportional compute cost, training a Chinchilla-abiding GPT-4 with 175B parameters would cost hundreds of millions of dollars.
My guess is that it's somewhat a combination of all of the above or something like that. If you're relatively confident that your rumor is correct, it lets us update on other related markets.
@BionicD0LPH1N I would put maybe 70% on it being mostly correct but with not very much confidence.
It mentioned a few things, including that GPT-4 is significantly larger than GPT-3.
I think 10M$ is not a small amount. It's plausible that somebody like OpenAI can train 175B+ LLM with Chinchilla laws for 10M. MosaicML claim they can do GPT-30B with original GPT3-like quality for 450k; they probably started a couple of months ago and may have worse methods and/or more expensive hardware https://www.mosaicml.com/blog/gpt-3-quality-for-500k
Scaling laws can be somewhat different too. Especially for multimodal data.
Another possibility would be (this is my idea) to train using maybe 50% of optimal data and then just wait until they got H100 or better datasets etc., and then just continue training and release a new checkpoint.