In what year will a language model which was pre-trained on an LM objective (causal or acausal) be fine-tuned using a self-play/expert iteration technique and achieve significant performance increase on any of (1) natural language modelling (2) code generation (3) proof generation?
Significant performance will be judged at my discretion, but roughly I consider the difference between the original GPT-3 (with chain-of-thought) and Chinchilla/PaLM+CoT to be significant -- with anything less considered insignificant.
Both the no-fine-tune language model, and post-self-play/ExIt model will be evaluated using the best available techniques e.g. allowing for CoT, prompt engineering etc. If the model is not publicly available, but I believe with >50% credence that this bar has been met, I will resolve this market to the corresponding year.
Caveats: If the pre-ExIt model was already fine-tuned with e.g. RLHF, I will still consider it as a baseline. If the no-fine-tune LM is far below SotA, I may not consider the result valid; roughly the LM in question will need to be at the previous year's SotA language modelling level.
Motivation: Nostalgebrist has estimated that currently there is ~3T tokens of text data available on the internet, so current methods have <1 OOM of data scaling available. To predict continued LM scaling, we need to consider how soon alternative methods for compute scaling will become widespread.