We are going to start running out of data to train large language models in [YEAR]
9
137
780
2030
5%
2023
13%
2024
26%
2025
18%
2026
7%
2027
8%
2028
8%
2029
14%
2030

Will resolve to N/A if the condition fails i.e. No such shortage of data ends up taking place.

Get Ṁ200 play money
Sort by:

Models will never run out of data to train on. The key distinction here is organic vs. synthetic data; we can always produce more of the latter but are quite limited (from a long-term perspective) by the former.

Please note that this question is specifically about large language models.

Can you elaborate on what you mean by “run out” and “shortage”?

@BTE Intuitively:

If you imagine training as a function of model parameters (P), architecture efficiency (E), number of unique data points (N_U) , number of augmented data points (N_A) , Computational capacity (C), storage (S) etc (to name the major ones) then

training = f(P, E, N, C, S)

As P increases, we should increase the NU+NA=N datapoints proportionally to ensure that we don't end up with P >> N which might lead to overfitting (Similarly, C and S have a similar relationship to N as well).

Now, usually when people say "run out of data", the implication is that with an increase in parameters P, there might not be a proportional increase in datapoints N for optimal training, which leads to a diminished training efficiency. That is what I'm going with for now.

@firstuserhere Do we know the rate at which new data is generated?

@firstuserhere Also is this about all modalities or just text?

@BTE Just text generative models. However, it is possible for multiple modality-input --> single modality (text) output models to be the standard norm.

@firstuserhere And how does synthetic or augmented data factor? Like how many times can you augment a dataset before it loses its value as grounded in truth?

@BTE Yes, that's fine. People speculate that augmenting data doesn't help beyond a few iterations and has diminishing returns and the same argument is given for synthetic data.

I don't personally buy it for Synthetic data, especially with models of the form (multi modal -> text) generating synthetic text data. So, I expect that as long as synthetic data isn't bad for model training, this is going to be juiced as much as possible. If not, we might start running out of data :)

@firstuserhere I think this is an overlooked point usually. You could have a model of the form

text + audio + image -> text

and this text is possibly "richer" in quality than just synthetic data taken from a model of the form

text -> text

@firstuserhere Maybe. It all gets converted to tokens so I guess it depends whether there is added value in having pixel tokens and text tokens (both numbers once tokenized) used to generate new text. It’s an interesting idea.

@BTE Agreed, which is why I think a bigger tokenizer is almost certainly better for performance. Although the following questions are not very useful because they ask for OpenAI's actions (i.e. whether they will release it instead of using it), they're still worthwhile

@firstuserhere So if the rate of increase in the number of parameters falls to keep the number of parameters proportional to the amount of data, then you wouldn't say we're running out of data?