Will GPT4/Opus report >50% score on ARC in 2024?
Basic
48
20k
Dec 31
65%
chance

ARC is a general-purpose AI eval designed to test intelligence as opposed to memorization. https://arcprize.org/arc

This market resolves to Yes if there are public demonstrations of GPT4 or Claude Opus solving at least 50% of the ARC questions in 2024.

(note that this is separate from winning the arc price, which requires only using open-source models)

Get Ṁ600 play money
Sort by:
bought Ṁ100 YES

https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o#comments

I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples

Just need 1 more %

Shouldn't this already resolve YES, since the criteria says "at least 50%" not "more than 50%"? What do you think @Lonis?

Just awaiting an update here on the ARC-AGI-PUB leaderboard! https://arcprize.org/leaderboard

I’m confident Ryan’s approach can’t be counted as YES.
What he did is impressive and is a great demonstration of LLM's capabilities. However, the description of the current market mentions only the LLMs and nothing about assisting scripts. So, I think it's a huge stretch to resolve it as YES. What do you imagine when you read the question "Will X report >50% score on Y"? I'm sure it's something like "give X some Y input with instructions, then expect 50% of the output to be correct. Does it sound like the solution we have so far?
Sure, we don't have to be so technical, and I'd agree with YES even if the solution included anything of the next:

  • converting the puzzles to ASCII representations to work around weak vision

  • giving the LLM very detailed instructions in the prompt, and giving different prompts for different types of puzzles.

  • allowing multiple tries with additional prompts explaining the mistakes to the LLM

    But it's more than that. Python program generates thousands of prompts with different representations of the grid and detailed instructions on how to approach the puzzles (I could accept YES if it stopped on this point). Then, Ryan's script uses some sophisticated calculations to find the best 12 solutions and, based on those, generates new prompts that include comparisons of expected results with what GPT's code generated, explaining where LLM got short. The second stage also includes 3000 new samples and picking the best ones.
    I have nothing against this approach, but it doesn't fall under "ChatGPT solving ARC" definition. It's more like "Ryan Greenblatt's script solves ARC by generating 8000 prompts for ChatGpt ".

@SashaPomirkovany I don't think that's a fair description. AFAIU it's for sure a kind of hybrid approach, where the scripting and the LLM's intelligence work together, but it's a situation where they strongly complement each other, and bring out an unprecedentedly generalized problem solving capability out of each other. Something that neither one could have achieved before. We cannot discount this, as a simple script could never achieve this result by itself, it very heavily leans on the LLM, as vice versa.

@Sjeyua, but what's not fair about my description? I mentioned how impressive the script itself is and how it opens up LLM capabilities. I'd say your description is a bit harsh, as you called Ryan's code a "simple script." It's not a simple script but a smart, sophisticated solution.
I agree that the way I mentioned 8000 prompts may have sounded toxic. But it wasn't to criticize the solution but to emphasize that it can't count as YES for this bet.
You mentioned that it's a hybrid solution and the description does not mention hybrid solutions being allowed. If the description would say, "Automated solution scoring 50% ARC with use of ChatGPT," I wouldn't vote NO.

@SashaPomirkovany I'm sorry if I come off as harsh. I've thought a bit more about it since then and currently my position is a bit more soft, somewhere between either resolving this as YES and 50%. If only 75% resolution was possible or some such.

For the record, I did not call his script simple, I just made the assertion that a simple script couldn't have achieved this by itself. But re-reading the context, I should likely omit the word 'simple' from there altogether, as maybe no currently human-writable script could achieve 50% by itself, no matter how complex.

I have another bit of observation: even though there was (insofar as advanced) scripting, scripting is in a very important sense static. Write once, execute many times. Which means that either

1) the script by itself embodies (at least some aspect of) general intelligence to some extent

2) helps general intelligence be amplified from the LLM, even if it's dimly present there

3) the ARC-AGI challenge falls short in an important way from measuring the generality of intelligence
4) some combination of the above. Is there any other option that's possible?

And if no other option is possible, I'm once again leaning more towards resolving as YES (85%?), as those reasons are I think good reasons to resolve as YES for.

Let me know what you think of that.

75% resolution is possible

@Sjeyua

I'm sorry if I come off as harsh

No, I didn't feel like you were harsh, it was mostly a joke. Sorry

scripting is in a very important sense static. Write once, execute many times.

ARC evaluation set is also static. You can use the power of IF statements if you know beforehand what kind of puzzles you would get. But this static script would fail as soon as we replaced the static set with auto-generated puzzles of different (non-rectangular) shapes. You can verify that by checking the code

the script by itself embodies (at least some aspect of) general intelligence to some extent

Even if this statement is true, it's a strong point for the market to resolve as No. ARC is an attempt at a metric for general intelligence. If it's the script that embodies GI, then the credit for scoring 50% should go to the script and not to ChatGPT

helps general intelligence be amplified from the LLM, even if it's dimly present there

How would you prove this take? Let me start by producing a counter-take. I think that the general intelligence here is provided by human input. The script results from human intelligence generating a strategy and converting it to code. Ryan's code includes the strategy of how to get better prompts, how many tries to do, and how to choose the best results. That's what GI does: implementing, trying, and then adopting strategies to solve a problem. However, the script's "GI" is limited to the hardcoded implementation, which is just enough to solve tasks from the public evaluation set. For example, notice how the script has hardcoded logic to break the puzzles into two buckets: (1) with input and output grids sized the same and (2) sized differently. The script will fail if you give it a puzzle with a grid that is shaped like, for example, a heptagon. To adopt, you would need a human to extend the script again. In the chain of Human/Script/ChatGPT, only the human can adapt when the puzzle becomes slightly different.

the ARC-AGI challenge falls short in an important way from measuring the generality of intelligence

This could be true, but has nothing to do with this market. The question is specifically about ChatGPT/Opus getting a 50% score on ARC

@Lonis, the leaderboard won't be updated for now; as Ryan mentioned, he has to write a Kaggle notebook to follow the rules for Public leaderboard submission. That would require RAM optimization for his script and can take some time
I think you should clarify if you will count his solution as YES. I don't know why you would though, as the question is about ChatGPT solving it. Ryan's approach is program synthesis script that uses ChatGPT. ChatGPT does a significant part of the task, but in the end it won't be possible without the script that separates types of puzzles into buckets, generates prompts, runs statistical analysis on the results, makes adjustments etc.

opened a Ṁ1,000 NO at 50% order

@Lonis is this training or test dataset?

The public evaluation set! https://arcprize.org/guide#public