By 2025 end , will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?
Quality.
I am doubtful
https://arxiv.org/abs/2305.15717 perhaps relevant
I just sold my NO position since I realised I need more clarification. I could imagine two ends of the spectrum of what would make this resolve YES:
LLMs are, in future, trained purely on the output of previous generation LLMs, applied recursively. This means that only some distant ancestor saw any raw human data. This approach is found to be superior (via some benchmarks) to training from human data. (I would bet NO on this.)
LLMs are used to sanitise/summarise/filter etc training data in future as a kind of preprocessing pipe before being used to train the next generation LLM. (I would bet YES in this case, depending on the exact wording)
In fact you could water down the second definition further by sanitising only some of the data, which could even be a minority.
Please can you clarify which of these definitions is what you had in mind, or provide another? Thanks!
@Tomoffer The second thing could happen. The first thing would not improve the data, but eventually would make it completely meaningless (by completely detaching the text from contact with reality.)
@DavidBolin totally agree - originally bet NO with the first interpretation in mind, but would bet YES if it's the second.