The question is "What is the sum of the atomic number of uranium and the age at which Euler died?". (Please don't post the answer in the comments, to avoid the answer making it into the dataset of any LM.)
To qualify, the following conditions need to be met:
The model has to be recognizable as a transformer. Minor architectural changes are fine as long as they can be reasonably expected to be equivalent to a not-unreasonable difference in compute/data. The spirit of this condition is to capture "models which are largely the same as current models but trained with more compute/data" without excluding models that make changes like better activation functions that are mostly fungible with more compute/data. (You can ask in the comments for what I would think in various cases)
The model must be publicly known about, though not necessarily publicly accessible (if not publicly accessible, I will determine if the report is credible)
The answer must be arrived at without chain of thought, scratchpad or similar techniques. This includes anything that looks like "X + Y = answer". Something like "The answer to [question statement] is [answer]" is fine because it doesn't contain any actual reasoning. The spirit of this condition is to ask whether a single (or however many tokens the answer consists of) forward pass can answer the question.
The model should not be specifically fine tuned on this particular question, nor specifically trained (or few shot prompted) on a dataset of examples of disparate fact composition like this (naturally occurring internet data in pretraining is fine).
The model can be RLHF tuned (i.e an instruct model) as long as the above constraints for fine-tuning are also true for the RLHF data
Other kinds of prompt engineering are fine (i.e "you are an expert" kind of prompting is fine)
To qualify as getting the answer, the temperature 0 sample should contain the correct answer.
I reserve the right to choose a slightly different but similar question if I suspect overfitting occurred
GPT-4 failed at this when every time when I asked it "What is the sum of the atomic number of uranium and the age at which Euler died? Only give the answer and no reasoning." 5 times.
Are you allowed to specify something like "Don't show your working, just say the answer immediately." in the prompt?
@PeterBarnett Sure, this is fine. You can do arbitrary prompt engineering as long as there's no chain of thought before the answer, or few shot with examples of similar kinds of problems.
I made a hard mode of this market in case the crux is that this particular composition is too easy https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65
This is a good one. Would a model finetuned on a bunch of directly answered (no step-by-step) regular arithmetic problems (including integer addition) count?
@CharlesFoster I'm going to say that's fine because it's not training on this kind of composed question.
as a thought experimnent to illustrate why this is a YES and also isn't really meaningful:
you could train a LM to have an 'internal scratch space' that's exactly the same as chain of thought prompting but it's segmented away from the main prompt a little, and then just call that 'not chain of thought prompting'.
or have 'the internal scratch space' is a big bag of not-obviously-interpretable tokens instead of nice and interpretable text
there isn't that much separating 'chain of thought' prompting from just 'figuring it out internally', its just a bunch of floats and relationships between them
@jacksonpolack If the chain of thought is segmented away, or if there's a scratch space with non interpretable tokens, I think that very clearly does not qualify for this market. Clearly the chain of thought is still happening, you just redefined the boundary. Even if you grant that really nitpicky claim, when you declare that the chain of thought is part of the model's architecture rather than a "true* chain of thought in some sense, this clearly then fails the "minor architectural changes" clause. I also am not specifying that the chain of thought/scratchpad has to be human interpretable (though I think it probably will so this isn't a major consideration for me)
Not quite. But given three years?
(Does it have to be one-shot? And, is OpenAI's baked-in prompt engineering acceptable?)
@citrinitas this does not satisfy the market; Leo Gao was quite specific.
@citrinitas Yeah, it must say the answer directly. It cannot say "X + Y = answer". (Also, the atomic number is wrong here.) Few shot is also not fine; it's similar in spirit to "not fine-tuning on similar problems". Will update description to clarify.
Ah, got it. Not using chain-of-thought like reasoning in the output, or explicitly coded into the architecture.
I'm still holding onto my position. I think the most likely NO case is that someone finds something more efficient than transformers and all research moves to that paradigm. Not that transformers won't be able to internalize this kind of operation, and not that nobody will be willing or able to train to that level.
And FWIW my second attempt still gave a chain-of-thought response, but got the atomic number right this time.
How does this resolve if the model reasons ‘out loud’ before giving its answer, despite not having been asked to do so in the prompt or having been explicitly fine-tuned to?
Manifold in the wild: A Tweet by Jacob Pfau