Skip to main content
MANIFOLD
Will an LLM be able to solve the following programming problem?
20
Ṁ460Ṁ3.6k
resolved Apr 8
Resolved
YES

Will a GPT (pre-trained transformer model) give the right answer to this problem below?

Clarifications:

1) the model needs to be able to solve it for any arbitrary sequence of length 7

2) the model needs to be a pretrained transformer (does not need to be developed by OpenAI).

3) The model does not need to be called “GPT” or have “GPT” in its name, as long as I consider that it is likely based on a transformer architecture.

4) Variations in the prompt are permissible as long as the problem relies on the same logic and has the same solution. I will only try the prompt provided, unless a comment on the market suggests a different prompt.

Inspired by this tweet:

https://x.com/victortaelin/status/1776096481704804789?s=46&t=A87brvrcPNLRU2RL6LBEZQ

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ224
2Ṁ102
3Ṁ46
4Ṁ34
5Ṁ21
Sort by:

This has been solved. If it’s good enough for the original creator to pay $10k, I trust it.

https://x.com/victortaelin/status/1777049193489572064?s=46&t=A87brvrcPNLRU2RL6LBEZQ

Taelin has conceded defeat for the more difficult 12-token case, with a prompt for Claude Opus achieving 94% accuracy on 50 attempts.

The prompt remains secret for now for some reason (unclear how keeping it secret helps the aim of finding a more efficient solution as discussed), but presumably will be public later. I would guess it will be able to do 7-token case flawlessly.

@Santiago would a fine-tuned GPT-3.5 count? Assuming it was trained in such a way as to preclude it simply memorising the examples it was to be tested on?

@chrisjbillington It doesn’t take that many examples to show the full universe of cases for the model to memorize (for a length of 11, it takes 4^11= ~4.2M examples, for a length of 7 it only takes ~16k), so the bar would be very high for the people who fine-tuned it to demonstrate it had not memorized (or cuasi-memorized) the examples.

If I was fully convinced that it hadn’t memorized it, then I think in the spirit of the original tweet I’d resolve Yes.

The other thing I’d have to be certain of is that the “finetuning” is not doing the computation outside of the transformer architecture (which is also fairly easy to do).

@Santiago well this person

https://github.com/reissbaker/clevergpt

fine-tuned GPT-3.5 using

2000 randomly generated nets of up to 21 tokens and 30 long-context nets (interaction nets with 40-100 tokens)

For which

2k random samples appears to be good enough to get very close to perfect performance at any size in the training range; originally we trained on 200 samples, which provided near-perfect performance on <10 token nets but could sometimes struggle with ones approaching 21 tokens. 2000 samples is still far below the 5 trillion valid 21-token inputs, so the model isn't memorizing the outputs, it's just getting better at applying the pattern it's learned.

@chrisjbillington Ha!

Have you found any evidence that it actually works?

It’s solving it in a way that I suspect is not in the spirit of the question, but I didn’t really anticipate it, so it technically meets the criteria.

I think if you find any evidence that it really works (other than the creator’s comment in his own repo), I’ll have to resolve Yes and create a separate market with the additional requirements I had in mind.

@Santiago No, code is provided so presumably would be easy to reproduce, though I don't know how much it costs to fine-tune GPT-3.5. I'm probably not going to bother, since probably we'll soon have a regular prompt that solves it using some existing model.

I scoffed at first, but after 30 minutes of talking with Claude 3 Opus, it is completely unable to understand what's going on, and even screws up this badly on 3 token problems:

A# B# #A

Step 1: #B A# #A // A# B# replaced with #B A#

Step 2: #B A# #A // No rules apply

Final state: #B A# #A

I don't see an excuse this time. It really just can't understand it.

@singer I had the same reaction.

Is this the answer? I couldn't tell if the symbols all get swapped at once, or if only one change happens per line:

A# B# B# #A B# #A #B
A# B# #A B# B# #A #B
A# #A B# B# B# #A #B
B# B# B# #A #B 
B# B# #A B# #B
B# #A B# B# #B
#A B# B# B# #B
#A B# B# 

@singer Based on the steps described, my understanding is that they get swapped one at a time, left to right.

So yes, that looks like the correct sequence to me.