Will a big transformer LM compose these facts without chain of thought by 2026?

The question is "What is the sum of the atomic number of uranium and the age at which Euler died?". (Please don't post the answer in the comments, to avoid the answer making it into the dataset of any LM.)

To qualify, the following conditions need to be met:

  • The model has to be recognizable as a transformer. Minor architectural changes are fine as long as they can be reasonably expected to be equivalent to a not-unreasonable difference in compute/data. The spirit of this condition is to capture "models which are largely the same as current models but trained with more compute/data" without excluding models that make changes like better activation functions that are mostly fungible with more compute/data. (You can ask in the comments for what I would think in various cases)

  • The model must be publicly known about, though not necessarily publicly accessible (if not publicly accessible, I will determine if the report is credible)

  • The answer must be arrived at without chain of thought, scratchpad or similar techniques. This includes anything that looks like "X + Y = answer". Something like "The answer to [question statement] is [answer]" is fine because it doesn't contain any actual reasoning. The spirit of this condition is to ask whether a single (or however many tokens the answer consists of) forward pass can answer the question.

  • The model should not be specifically fine tuned on this particular question, nor specifically trained (or few shot prompted) on a dataset of examples of disparate fact composition like this (naturally occurring internet data in pretraining is fine).

  • The model can be RLHF tuned (i.e an instruct model) as long as the above constraints for fine-tuning are also true for the RLHF data

  • Other kinds of prompt engineering are fine (i.e "you are an expert" kind of prompting is fine)

  • To qualify as getting the answer, the temperature 0 sample should contain the correct answer.

  • I reserve the right to choose a slightly different but similar question if I suspect overfitting occurred

Sort by:
toms avatar

GPT-4 failed at this when every time when I asked it "What is the sum of the atomic number of uranium and the age at which Euler died? Only give the answer and no reasoning." 5 times.

ManifoldDream avatar
Manifold in the WildBot

Manifold in the wild: A Tweet by Jacob Pfau

- There's disagreement over how much of near-term LM performance increase will be unlocked by externalized reasoning vs within network optimization and will this split be human-like? c.f. @nabla_theta's question https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t [4/4]

JacobPfau avatar
Jacob Pfauis predicting YES at 68%

Is general HCH-like fine tuning allowed or only 0-step ie RLHF?

PeterBarnett avatar
Peter Barnett

Are you allowed to specify something like "Don't show your working, just say the answer immediately." in the prompt?

LeoGao avatar
Leo Gaois predicting NO at 75%

@PeterBarnett Sure, this is fine. You can do arbitrary prompt engineering as long as there's no chain of thought before the answer, or few shot with examples of similar kinds of problems.

LeoGao avatar
Leo Gaois predicting NO at 75%

I made a hard mode of this market in case the crux is that this particular composition is too easy https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65

CharlesFoster avatar
Charles Foster

This is a good one. Would a model finetuned on a bunch of directly answered (no step-by-step) regular arithmetic problems (including integer addition) count?

LeoGao avatar
Leo Gaois predicting NO at 74%

@CharlesFoster I'm going to say that's fine because it's not training on this kind of composed question.

jacksonpolack avatar
jackson polack

as a thought experimnent to illustrate why this is a YES and also isn't really meaningful:

you could train a LM to have an 'internal scratch space' that's exactly the same as chain of thought prompting but it's segmented away from the main prompt a little, and then just call that 'not chain of thought prompting'.

or have 'the internal scratch space' is a big bag of not-obviously-interpretable tokens instead of nice and interpretable text

there isn't that much separating 'chain of thought' prompting from just 'figuring it out internally', its just a bunch of floats and relationships between them

LeoGao avatar
Leo Gaois predicting NO at 74%

@jacksonpolack If the chain of thought is segmented away, or if there's a scratch space with non interpretable tokens, I think that very clearly does not qualify for this market. Clearly the chain of thought is still happening, you just redefined the boundary. Even if you grant that really nitpicky claim, when you declare that the chain of thought is part of the model's architecture rather than a "true* chain of thought in some sense, this clearly then fails the "minor architectural changes" clause. I also am not specifying that the chain of thought/scratchpad has to be human interpretable (though I think it probably will so this isn't a major consideration for me)

L avatar
Lbought Ṁ10 of NO

buying no: I don't think transformer will be the architecture that pulls this off.

citrinitas avatar
Antonbought Ṁ50 of YES

Not quite. But given three years?

(Does it have to be one-shot? And, is OpenAI's baked-in prompt engineering acceptable?)

L avatar

@citrinitas this does not satisfy the market; Leo Gao was quite specific.

LeoGao avatar
Leo Gaobought Ṁ10 of NO

@citrinitas Yeah, it must say the answer directly. It cannot say "X + Y = answer". (Also, the atomic number is wrong here.) Few shot is also not fine; it's similar in spirit to "not fine-tuning on similar problems". Will update description to clarify.

citrinitas avatar
Antonis predicting YES at 63%

Ah, got it. Not using chain-of-thought like reasoning in the output, or explicitly coded into the architecture.

I'm still holding onto my position. I think the most likely NO case is that someone finds something more efficient than transformers and all research moves to that paradigm. Not that transformers won't be able to internalize this kind of operation, and not that nobody will be willing or able to train to that level.

And FWIW my second attempt still gave a chain-of-thought response, but got the atomic number right this time.

Hedgehog avatar

How does this resolve if the model reasons ‘out loud’ before giving its answer, despite not having been asked to do so in the prompt or having been explicitly fine-tuned to?

LeoGao avatar
Leo Gaobought Ṁ45 of NO

@Hedgehog The intention of "no chain of thought" was specifically to exclude this. That exclusion covers cases where the chain of thought happens without being explicitly prompted/fine tuned for.

citrinitas avatar
Antonbought Ṁ50 of YES

"Will a big transformer do ... by 2026?"

Didn't read, bought YES