Will any widely used LLM be pre-trained with abstract synthetic data before 2030?
Basic
4
725
2030
74%
chance

For this purpose, abstract synthetic data refers to data generated by an algorithm that can be stored in less than 100MB, such as an algorithm that randomly generates programs and runs them.

Motivation:

Neural network models can learn the same task through different methods:

  • Pre-training: 10^7-10^8 samples

  • Fine-tuning: 500-50000 samples

  • Few-shot learning: 5-10 samples

  1. Initially, most models were directly pre-trained on the required task, such as digit classification.

  2. Later, models were pre-trained on more general but still directly useful tasks, such as classifying images into thousands of classes via supervised learning, and then fine-tuned on the required task.

  3. Currently, models are pre-trained on seemingly less useful tasks, like next-token prediction, then fine-tuned on more useful tasks, such as question answering. The final task could be seen as few-shot or zero-shot learned.

  4. In the future, models might be pre-trained on completely abstract tasks, such as predicting the initial state of a Turing machine from its output. This approach could enable them to learn tasks requiring longer context and deeper reasoning while being cheaper to generate in terms of infrastructure. They could then learn about the real world through fine-tuning and/or few-shot learning.

State-of-the-art models for in-context tabular data prediction, such as TabPFN, are already trained on fully synthetic data.

Get Ṁ1,000 play money