End of pre-training era for language models: Will an LM fine-tune for more FLOPs than it is pre-trained for, before 2026

1kṀ1595

2026

68%

chance

ALL

The LM must have 'frontier' performance (having PILE/or similar perplexity above one year prior's SotA). The LM must have been trained after 2022.

If it's unclear whether this has happened, I will give this a year to resolve. If it remains plausibly unclear the market will resolve N/A.

Fine-tuning includes all RL training. Training on synthetic data, or additional supervised learning (which is deliberately trained on after training on a PILE-like generic dataset) counts as fine-tuning. If the nature of pre-training changes such that all SotA models do RL/instruction training/etc. during the initial imitation learning phase, I will probably resolve this question as ambiguous. Multi-modal training of text+image will by default count as pre-training.

Technical AI Timelines

AI Safety

Get

1,000

to start trading!

People are also trading

Will an AI model use more than 1e28 FLOPS in training before 2026?

8% chance

Will a lab train a >=1e26 FLOP state space model before the end of 2025?

15% chance

Will Transformer based architectures still be SOTA for language modelling by 2026?

80% chance

By 2030, will large language models still be at the peak of AI? [DRAFT]

25% chance

Will a machine learning training run exceed 10^26 FLOP in China before 2026?

52% chance

We are going to start running out of data to train large language models in [YEAR]

When will a language model be fine-tuned via self-play or expert iteration and achieve significant performance increase?

2025

In Jan 2027, it will be standard practise for non-AI-building tech companies to finetune and train their own models

56% chance

Will there be a more sample-efficient pretraining algorithm than next token prediction for NLP before 2027?

43% chance

Will Scaling Laws for Neural Language Model continue to hold till the end of 2026?

Sort by:

Well done @JacobPfau on this market! I thought pretraining was going to remain FLOPs dominant for some time, because I thought there was still lots of world knowledge still to obtain through new modalities for pretraining (especially video). I was wrong, and RL compute has scaled a lot quicker, judging from the recent Grok release.

Quick back-of-the-envelope got me that LLAMA 3.1 used ~100B tokens of post-training compute.

https://scontent-lga3-2.xx.fbcdn.net/v/t39.2365-6/452387774_1036916434819166_4173978747091533306_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=t6egZJ8QdI4Q7kNvgGLd_WP&_nc_ht=scontent-lga3-2.xx&oh=00_AYAi3mcauEKuekcEn4CRpsF-igaR2I_3eBGde533LIM8eQ&oe=66A6EB8D#page=14.39

Couldn't find a particular FLOP or token count summarizing this (hence BOTEC), but I might have missed something.

predictedYES

For reference, current estimates of most expensive training run pre-2026 is at 10x GPT-4 price. https://www.metaculus.com/questions/17418/most-expensive-ai-training-run-by-year/

Meanwhile information on the data supply of text is available here https://epochai.org/trends#data-trends-section

Added some detail to clarify "Fine-tuning includes all RL training. Training on synthetic data, or additional supervised learning which is deliberately trained on separately from a PILE-like generic dataset counts as fine-tuning. If the nature of pre-training changes such that all SotA models do RL/instruction training/etc. during the initial imitation learning phase, I will probably resolve this question as ambiguous."

This does not really make sense to me given the purpose of pre-training in bulk knowledge learning, and fine-tuning in setting certain behaviours. In fact very little SFT data is required to get desired behaviour; certain papers such as LIMA have shown this.

Yann LeCun has a famous cake analogy that describes this.