Referring to Jim Propp's Self-Referential Aptitude Test. The model should output correct solution at least 75% of the time it is suitably prompted.
The ability should be demonstrated before 1st Jan 2027 in order for this market to resolve YES, but if there is an exceptionally strong suspicion that a specific non-public LM should be able to solve it, I'll wait until I can test it on that or another very similar model. However, if someone finds a prompt setup which makes say GPT-4 solve it correctly, but they find it after 2026, this market still resolves NO.
The model is allowed to be arbitrarily prompted, as long as
There is no human interaction following the initial query (but the model is allowed to e.g. critique its outputs and refine them).
There is no information leak about the solution in the prompt, with the possible exception of the answer to question 20 (which is somewhat subjective).
The model is not allowed to use any outside tools (e.g. an interpreter) except possibly a scratchpad where it can write down its thoughts outside of its context, or similar.
The model should ideally be able to explain its reasoning in detail, which I would then check. If the model produces erroneous reasoning but gets it right (despite errors or perhaps without thinking out loud), I'll default to assuming that it's just a leak since the solution can be found online – but if there is a strong reason to suspect that it is not a leak (e.g. the model is known to display very strong logical reasoning in other contexts), I'll create variations on the test and see whether model can solve them correctly.
Creator policy: I won't bet.
See the 2024 version: