For this purpose, abstract synthetic data refers to data generated by an algorithm that can be stored in less than 100MB, such as an algorithm that randomly generates programs and runs them.
Motivation:
Neural network models can learn the same task through different methods:
Pre-training: 10^7-10^8 samples
Fine-tuning: 500-50000 samples
Few-shot learning: 5-10 samples
Initially, most models were directly pre-trained on the required task, such as digit classification.
Later, models were pre-trained on more general but still directly useful tasks, such as classifying images into thousands of classes via supervised learning, and then fine-tuned on the required task.
Currently, models are pre-trained on seemingly less useful tasks, like next-token prediction, then fine-tuned on more useful tasks, such as question answering. The final task could be seen as few-shot or zero-shot learned.
In the future, models might be pre-trained on completely abstract tasks, such as predicting the initial state of a Turing machine from its output. This approach could enable them to learn tasks requiring longer context and deeper reasoning while being cheaper to generate in terms of infrastructure. They could then learn about the real world through fine-tuning and/or few-shot learning.
State-of-the-art models for in-context tabular data prediction, such as TabPFN, are already trained on fully synthetic data.
Pre-training image models on fractals is already a common practice. For example, see: