Conversational LLM
would bet more but NO on low % is inefficient.
but, while I would imagine most people/applications wouldn't be using the "best LLMs", those LLMS would be relying on at least that many parameters. We can assume we can at least close-to-chinchilla-saturate, probably fully-saturate 1.75 trillion parameters (GPT-4), and I have to imagine we'll be generating huge new swaths of synthetic data several orders of magnitude above what was used for GPT-4 that we'll be wanting to incorporate, so that means more parameters. Of course, we may get much more param efficient, but more is always better right, and if the compute is also getting massively cheaper which I'm sure it will by 4 years from now as NVIDIA competition is at full steam, even if it's nbd to run at fewer params, why not go big regardless if you're shooting for the best?
@TomPotter I guess I could imagine a scenario where compute is most efficiently used with smaller parameter matrices ... where it's really just about the ratio, a tradeoff of benefits for params/cycles, and maybe it could happen that that ratio gets pushed real far in the direction of cycles over parameters due to whatever the techniques of the time demanding. Still, 29% seems too high.
@TomPotter
I think there may be an answer described here:
https://www.youtube.com/watch?v=1CpCdolHdeA
at 8:00 in. He says that the amount of data and the size of the model [number of parameters] both scale as sqrt(compute).
And in the previous few minutes of the conversation he references the 100M for GPT-4 -> 1B -> 10B scaling of $$ going into compute * the (compute/$ increases from new GPU tech). So of course, there will be a huge increase in compute going toward the best 2027 model; Dario was only referencing til thru 2025, so that's another 2 years beyond that, and we're already talking ~100-300x GPT-4 compute in 2025 based on Dario's numbers.
So that means we should expect parameters to be ~ sqrt(300)*1.75 = 30.3 T.