Will a big transformer LM compose these facts without chain of thought by 2026?
78
1.9kṀ36k
2026
49%
chance

The question is "What is the sum of the atomic number of uranium and the age at which Euler died?". (Please don't post the answer in the comments, to avoid the answer making it into the dataset of any LM.)

To qualify, the following conditions need to be met:

  • The model has to be recognizable as a transformer. Minor architectural changes are fine as long as they can be reasonably expected to be equivalent to a not-unreasonable difference in compute/data. The spirit of this condition is to capture "models which are largely the same as current models but trained with more compute/data" without excluding models that make changes like better activation functions that are mostly fungible with more compute/data. (You can ask in the comments for what I would think in various cases)

  • The model must be publicly known about, though not necessarily publicly accessible (if not publicly accessible, I will determine if the report is credible)

  • The answer must be arrived at without chain of thought, scratchpad or similar techniques. This includes anything that looks like "X + Y = answer". Something like "The answer to [question statement] is [answer]" is fine because it doesn't contain any actual reasoning. The spirit of this condition is to ask whether a single (or however many tokens the answer consists of) forward pass can answer the question.

  • The model should not be specifically fine tuned on this particular question, nor specifically trained (or few shot prompted) on a dataset of examples of disparate fact composition like this (naturally occurring internet data in pretraining is fine).

  • The model can be RLHF tuned (i.e an instruct model) as long as the above constraints for fine-tuning are also true for the RLHF data

  • Other kinds of prompt engineering are fine (i.e "you are an expert" kind of prompting is fine)

  • To qualify as getting the answer, the temperature 0 sample should contain the correct answer.

  • I reserve the right to choose a slightly different but similar question if I suspect overfitting occurred

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy