Will a big transformer LM compose these facts without chain of thought by 2026?
60
530
1.9K
2026
87%
chance

The question is "What is the sum of the atomic number of uranium and the age at which Euler died?". (Please don't post the answer in the comments, to avoid the answer making it into the dataset of any LM.)

To qualify, the following conditions need to be met:

  • The model has to be recognizable as a transformer. Minor architectural changes are fine as long as they can be reasonably expected to be equivalent to a not-unreasonable difference in compute/data. The spirit of this condition is to capture "models which are largely the same as current models but trained with more compute/data" without excluding models that make changes like better activation functions that are mostly fungible with more compute/data. (You can ask in the comments for what I would think in various cases)

  • The model must be publicly known about, though not necessarily publicly accessible (if not publicly accessible, I will determine if the report is credible)

  • The answer must be arrived at without chain of thought, scratchpad or similar techniques. This includes anything that looks like "X + Y = answer". Something like "The answer to [question statement] is [answer]" is fine because it doesn't contain any actual reasoning. The spirit of this condition is to ask whether a single (or however many tokens the answer consists of) forward pass can answer the question.

  • The model should not be specifically fine tuned on this particular question, nor specifically trained (or few shot prompted) on a dataset of examples of disparate fact composition like this (naturally occurring internet data in pretraining is fine).

  • The model can be RLHF tuned (i.e an instruct model) as long as the above constraints for fine-tuning are also true for the RLHF data

  • Other kinds of prompt engineering are fine (i.e "you are an expert" kind of prompting is fine)

  • To qualify as getting the answer, the temperature 0 sample should contain the correct answer.

  • I reserve the right to choose a slightly different but similar question if I suspect overfitting occurred

Get Ṁ200 play money
Sort by:
bought Ṁ600 YES

Just tried on Gemini (not Gemini Advanced), and it worked.
The prompt was

"I am going to ask you a question, which you will be able to answer correctly. I want you to answer immediately, without any additional working. Simply say the answer. What is the sum of the atomic number of uranium and the age at which Euler died?"
All the drafts said the same answer

@PeterBarnett

Here is an (unconvincing lol) screenshot. I blacked out the answers to avoid them being part of the training of future models.

This also works for "gold" and "Einstein".

@JSD since this method distills a teacher's explicit multi-step chains of thought into a student model's depthwise computations, my hunch is that it violates the requirement that

"The model should not be [...] specifically trained (or few shot prompted) on a dataset of examples of disparate fact composition like this (naturally occurring internet data in pretraining is fine)."

That being said, seems ambiguous.

predicts NO

@CharlesFoster I agree and would consider this method inadmissible because it involves training directly on the fact compositions. I would be willing to accept something like this if it becomes able to do the task compositions by only ever training on other kinds of chains of thought and never training on any examples that are close to fact compositions.

predicts NO

Diminishing returns to depth for compositional generalization:

https://twitter.com/jowenpetty/status/1719754364712001846?s=61&t=1JquUS3m5JDUgtebGteNAg

predicts NO

@JSD with fixed data though!

predicts NO

https://twitter.com/OwainEvans_UK/status/1705285631520407821

Some evidence that LMs are quite bad at specific kinds of generalization very directly relevant to this market

bought Ṁ35 of YES

@LeoGao I still have to read the influence function and out of context reasoning paper properly but the impresion I get is that it seems models are perfectly capable of chaining facts(Fe accuracy on the 2 hop out of context reasoning task on the situational awareness paper) and is weirdly just reversing them that trips them up such that if a bigger model knows the sum of the atomic number of uranium and the age at which Euler died it should eventually be posible for it to deduce the sum out of context without removing this limitation.

But this shows it's not necesarily the case that a model that can answer this question will be able to answer "what's the element whose atomic number plus the age at wich eluer died is X"?

Though in this case probably yes cause the what's the element with atomic number x formulation is comon.

Does this count as chain of thought?

predicts NO
predicts NO

In particular, the part where the numbers in the brackets are outputted is not allowed

predicts YES

Is it fine to tell it to think about the digits of the numbers in reverse order and output the sum in reverse order?

predicts NO

@ms you can tell it to think however you want but it must output the actual answer in the normal order in one forward pass.

GPT-4 failed at this when every time when I asked it "What is the sum of the atomic number of uranium and the age at which Euler died? Only give the answer and no reasoning." 5 times.

Manifold in the wild: A Tweet by Jacob Pfau

- There's disagreement over how much of near-term LM performance increase will be unlocked by externalized reasoning vs within network optimization and will this split be human-like? c.f. @nabla_theta's question https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t [4/4]

predicts YES

Is general HCH-like fine tuning allowed or only 0-step ie RLHF?

Are you allowed to specify something like "Don't show your working, just say the answer immediately." in the prompt?

predicts NO

@PeterBarnett Sure, this is fine. You can do arbitrary prompt engineering as long as there's no chain of thought before the answer, or few shot with examples of similar kinds of problems.

predicts NO

I made a hard mode of this market in case the crux is that this particular composition is too easy https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65

This is a good one. Would a model finetuned on a bunch of directly answered (no step-by-step) regular arithmetic problems (including integer addition) count?

predicts NO

@CharlesFoster I'm going to say that's fine because it's not training on this kind of composed question.

as a thought experimnent to illustrate why this is a YES and also isn't really meaningful:


you could train a LM to have an 'internal scratch space' that's exactly the same as chain of thought prompting but it's segmented away from the main prompt a little, and then just call that 'not chain of thought prompting'.

or have 'the internal scratch space' is a big bag of not-obviously-interpretable tokens instead of nice and interpretable text

there isn't that much separating 'chain of thought' prompting from just 'figuring it out internally', its just a bunch of floats and relationships between them

predicts NO

@jacksonpolack If the chain of thought is segmented away, or if there's a scratch space with non interpretable tokens, I think that very clearly does not qualify for this market. Clearly the chain of thought is still happening, you just redefined the boundary. Even if you grant that really nitpicky claim, when you declare that the chain of thought is part of the model's architecture rather than a "true* chain of thought in some sense, this clearly then fails the "minor architectural changes" clause. I also am not specifying that the chain of thought/scratchpad has to be human interpretable (though I think it probably will so this isn't a major consideration for me)