Will resolve to N/A if the condition fails i.e. No such shortage of data ends up taking place.
@BTE Intuitively:
If you imagine training as a function of model parameters (P), architecture efficiency (E), number of unique data points (N_U) , number of augmented data points (N_A) , Computational capacity (C), storage (S) etc (to name the major ones) then
training = f(P, E, N, C, S)
As P increases, we should increase the NU+NA=N datapoints proportionally to ensure that we don't end up with P >> N which might lead to overfitting (Similarly, C and S have a similar relationship to N as well).
Now, usually when people say "run out of data", the implication is that with an increase in parameters P, there might not be a proportional increase in datapoints N for optimal training, which leads to a diminished training efficiency. That is what I'm going with for now.
@BTE Just text generative models. However, it is possible for multiple modality-input --> single modality (text) output models to be the standard norm.
@firstuserhere And how does synthetic or augmented data factor? Like how many times can you augment a dataset before it loses its value as grounded in truth?
@BTE Yes, that's fine. People speculate that augmenting data doesn't help beyond a few iterations and has diminishing returns and the same argument is given for synthetic data.
I don't personally buy it for Synthetic data, especially with models of the form (multi modal -> text) generating synthetic text data. So, I expect that as long as synthetic data isn't bad for model training, this is going to be juiced as much as possible. If not, we might start running out of data :)
@firstuserhere I think this is an overlooked point usually. You could have a model of the form
text + audio + image -> text
and this text is possibly "richer" in quality than just synthetic data taken from a model of the form
text -> text
@firstuserhere Maybe. It all gets converted to tokens so I guess it depends whether there is added value in having pixel tokens and text tokens (both numbers once tokenized) used to generate new text. It’s an interesting idea.
@BTE Agreed, which is why I think a bigger tokenizer is almost certainly better for performance. Although the following questions are not very useful because they ask for OpenAI's actions (i.e. whether they will release it instead of using it), they're still worthwhile
@firstuserhere So if the rate of increase in the number of parameters falls to keep the number of parameters proportional to the amount of data, then you wouldn't say we're running out of data?