Will GPT-5 be able to solve A::B system puzzles consistently

Twitter user VictorTaelin tweeted the following: (original post)

"A simple puzzle GPTs will NEVER solve: As a good programmer, I like isolating issues in the simplest form. So, whenever you find yourself trying to explain why GPTs will never reach AGI - just show them this prompt. It is a braindead question that most children should be able to read, learn and solve in a minute; yet, all existing AIs fail miserably. Try it! It is also a great proof that GPTs have 0 reasoning capabilities outside of their training set, and that they'll will never develop new science. After all, if the average 15yo destroys you in any given intellectual task, I won't put much faith in you solving cancer. Before burning 7 trillions to train a GPT, remember: it will still not be able to solve this task. Maybe it is time to look for new algorithms."

The tweet contained an image with the following prompt:

"A::B is a system with 4 tokens: A#, #A, B# and #B.

An A::B program is a sequence of tokens. Example:

B# A# #B #A B#

To compute a program, we must rewrite neighbor tokens, using the rules:

A# #A ... becomes ... nothing

A# #B ... becomes ... #B A#

B# #A ... becomes ... #A B#

B# #B ... becomes ... nothing

In other words, whenever two neighbor tokens have their '#' facing each-other,

they must be rewritten according to the corresponding rule. For example, the

first example shown here is computed as:

B# A# #B #A B# =

B# #B A# #A B# =

A# #A B# =


The steps were:

1. We replaced A# #B by #B A#.

2. We replaced B# #B by nothing.

3. We replaced A# #A by nothing.

The final result was just B#.

Now, consider the following program:

A# B# B# #A B# #A #B

Fully compute it, step by step."

(The original post has tabs which I couldn't get working here)

Resolution criterion (important details)

This market will resolve YES if GPT-5 can solve these kinds of problems with good consistency (this will be judged by me) and NO if it can't. In the end all of the details will be judged using my best judgement but here are some important details / clarifications.

Some important details:

To count, the puzzles will have to be sufficiently long (at least 20 in-game tokens). The given prompt has to be identical to the original tweets image except for the line after "Now, consider the following program". GPT-5 is not allowed to use external tools (what is counted as an external tool is decided by my best judgement). For example it is not allowed to write a python program and run it with the code interpreter. If GPT-5 has a built-in code interpreter (or something equivalent) that can't be turned off, the market will resolve as N/A. The model has to be named GPT-5 (Or something very similar. This will again be decided by my best judgement). If OpenAI doesn't release a model called GPT-5 before 2030, the market will resolve as N/A.

Get Ṁ600 play money
Sort by:
bought Ṁ50 YES

Kenshin could do it with GPT-3.5 after he defeats every modern chess engine of course

bought Ṁ250 YES

Solve with Opus. So this will likely work with GPT5 too.

@jgyou That was with a specialized prompt right?

@Metastable I looked at the solution and it’s comically long and verbose for what the problem is. I’d be surprised if gpt 5 advanced so much over opus that it could do it from just the prompt in the image, as is required in this market.

@JureSmolar @Guuber3 Can you clarify what will be the input into the model - the image from the tweet or text string visible in the image?

@Metastable I’m reading it as the text in the image, I.e. the original gitlab problem. What I mean by my comments is that the current solutions don’t use the original problem as a prompt but rather an engineered one, which is very different from just the problem statement here. The solution from just the problem statement seems to me a much harder problem for a LLM to solve just based on its training data.

@Metastable The prompt will be the text from the image. GPT-5 has to be able to solve the problem(s) only from the description and the one example

opened a Ṁ1,500 YES at 50% order

Also, just in case there's no GPT-5

Limit order for YES up at 50%

"GPTs will never be able to compute a derivation in a (specific, simple) formal grammar" is certainly a hot take, but not one I expect to change anyone's mind when it inevitably turns out to be false.

I suspected current models have trouble with it partly because of the tokenizer, so I tried the same problem with the tokens changed to A,B,C and D and a few extra, helpful instructions and GPT-4 only made 1 mistake in step 2 which led to it having one more D (B#) in the result.

More related questions