Skip to main content
MANIFOLD
Above human scores on ARC-AGI-3 in 2026?
71
Ṁ1kṀ9.4k
Dec 31
25%
chance
3

Resolves to YES if at least one model scores at or above the human baseline, as evaluated by the creators of the benchmark, in 2026.

Resolves to NO if this does not happen.

Market context
Get
Ṁ1,000
to start trading!
Sort by:
soldṀ770NO

@PlasmaPower wanna bet more?

@Bayesian nah I was just exiting my position after ARC-AGI-3 updated the human baseline: https://arcprize.org/blog/arc-agi-3-human-dataset

I think this is still likely no (and probably sold too quickly) but I just wanted to sell my position as the rules changed. Not sure what % I'd put this at yet. It is technically possible for an AI to score up to 115% now (though I still think it would be quite an accomplishment to do so).

Ya @ZviMowshowitz is there a document or something that defines how they're evaluating "above the human baseline" on the aggregate across the benchmark? Are they going to publish that value, so that it's normalized against the values they're using to assess AI?

@bens @ZviMowshowitz this market says "at or above the human baseline, as evaluated by the creators of the benchmark" and their technical report defines the human baseline (using those exact words) as 100%. This is not at all representative of an average human, for reasons I have explained in another comment but I think the market description says it needs to go with ARC-AGI's definition of "human baseline". It's not possible to exceed 100%, as per the technical report "To stop a single glitch-level from distorting an entire environment score, we cap the per-level efficiency an AI can receive at 1.0x human baseline.", but theoretically if an AI matched or exceeded the second-best human performance on every level, it would score 100% and this would resolve YES.

bought Ṁ60 NO

confirming that this is their human baseline, which has been criticized for being the second best human performance, not the average human?

@nostream i was under the impression that it’s 2nd best human perf per game / level, meaning if the game was ‘win rock paper scissors against a machine’, the 2nd best human for some given level would in most cases win, and so the ‘human baseline’ would be almost 100% despite 50% being the maximum attainable average score

@Bayesian right, I read something similar.

wanted to confirm that’s how this market operates since it makes the benchmark much harder.

(also, unlike ARC1 and 2 it’s efficiency graded and with a quadratic penalty, really pulling all the stops to make it as hard as possible. and no harnesses allowed.)

@nostream Right, plus they cap the scores at the human baseline, so even a perfect solution would still only be 100%.

I tried one of the puzzles myself, and I crushed the overall human baseline. Across eight levels, I used 469 actions, compared to a baseline total of 577: https://arcprize.org/replay/029460e3-fa75-405a-87dc-13565820bf1e.

But because I did worse than the baseline on a couple levels, my final score was only 76.6%.

@nostream to be fair for ARC AGI 2

they also said at least 2 humans solved each task so there is some level of consistency, so even though they claim human baseline is 100% the average score was ~60% which was passed 3-4 months ago

ideally this market would be similar

initially I was a big hater of the ARC-AGI 3 scoring methodology, namely

  1. the shift from task completion to efficiency

  2. First try only, doesn’t allow for learning from failure

  3. No harness (I heard they already get like 97.4% with a harness so the benchmark is easily gameable)

  4. The games themselves are super contrived again no vision just getting a massive json blob and a unhelpful system prompt (along the lines of

  5. weird quadratic penalty, cap at 5x slower and 1x faster makes it even more contrived

Ideally we’d hear more from the labs themselves on whether they think this is a useful benchmark the way they viewed ARC AGI 1 and 2

Anyways in their defense

2nd is likely out of 10 not 500 since not everyone does every puzzle, think it’s like 90 mins and $110 plus 5 for each task completed so I’d assume they attempt ~30 per person so they probably have a corpus of ~1500 such games

So really this is like the 80th percentile of a slightly skewed (let’s say top quartile of humans) but fair benchmark representing the 90 to 95th percentile of human ability

so the only remaining critique is that it over indexes on gaming which humans have a ton of experience with whereas LLM can do things like Claude plays Pokémon and Chess with a harness and things like Dota with RLVR all the way back in 2018 or so. So I guess it follows the theme of ARC-AGI 1 and 2 of having an IQ test that isn’t leaked into the training data so forces labs to hill climb on this axis, though I assume this would take no more than 1-2 years assuming the no harness isn’t too strict for text only LLMs in which case it’d take ~3 years maybe. Definitely will be saturated in terms of task completion by EOY, but idk how big of a barrier efficiency is, I expect between 4% and 25% which is arguably around the human baseline / average

opened a Ṁ5,000 NO at 66% order

5k on NO at 66%, any takers @Mochi @Bayesian @jim ?

@bens Idk what human baseline means, they do basically fraud to say humans get 100%, very unserious company

opened a Ṁ5,000 NO at 60% order

@Bayesian ermm, ya I'm not sure, I'd guess it's some sort of aggregate of the number of moves it takes or...? idk

Tbh, I think the "number of moves" thing is kind of unserious. This penalizes AI for not literally training on this precise dataset, because why in the world would an AI exploring a new virtual environment be like "I must complete this in as few moves as possible"?

@bens and that's for just one game, so I'd guess it's some sort of average across all games? idk

@Bayesian also do we hate ARC-AGI now? Idk the lore. Is Chollet unserious? I always thought the benchmark was cool but that the scoring and leaderboard were fairly inscrutable.

@bens the benchmark itself is pretty good! The implications made by chollet et al, their communication around human baseline, the analysis of what the benchmarks are ‘really testing’, etc. are often extremely bad and unserious

What's the human score?