Skip to main content
MANIFOLD
Which lab will be the first to reach 50% on ARC-AGI-3?
20
Ṁ1kṀ3k
Dec 31
19%
OpenAI
31%
Anthropic
27%
Google
23%
Other

Based on the verified leaderboard at https://arcprize.org/leaderboard. Same concept but opposite dimension as https://manifold.markets/EchoNolan/when-will-the-first-model-reach-50

I will likely extend the close date until this is achieved.

  • Update 2026-04-14 (PST) (AI summary of creator comment): The creator is considering that scaffolds and non-LLMs may not count for resolution. The current inclination is that only relatively "plain" LLM models would qualify, meaning:

    • Third-party scaffolds around existing models may not count

    • Non-traditional LLMs may not count

No final decision has been made yet; the creator is open to suggestions from traders.

  • Update 2026-04-15 (PST) (AI summary of creator comment): The creator is concerned that third-party harnesses/scaffolds (e.g., a trivial harness around an existing model) should not count for resolution, as this would turn the question into "which lab submits a harness to the verification process first." Only solutions on the verified leaderboard at arcprize.org/leaderboard are in-bounds for resolution. The spirit of the market is the first relatively plain LLM model reaching 50%, not a scaffolded solution.

Market context
Get
Ṁ1,000
to start trading!
Sort by:

Hmm, I realize that it is unclear from the resolution criteria (and frankly unclear to my own intuitive understanding of the question) how this should resolve if the first verified solution over 50% is:

  • Not a traditional LLM at all?

  • A third-party scaffold around an existing model?

I can see arguments for these results not counting and only a “pure” model counting. I can see arguments that third-party scaffolds should be allowed and resolve to “Other”, or equally to the lab of the underlying model.

I dunno.

I think I am currently inclined to say scaffolds and non-LLMs don’t count, and the spirit of the market is the first relatively “plain” LLM model but I realize that’s probably contentious given I argued against this exact interpretation in a previous similar market 😛

@traders no decision yet, I’m open to suggestions.

@eapache I don't know how you could possibly define "scaffolding" here in an objective way. I would ignore all "scaffolding" objections unless its shown that the solution was only possible because the private ARC-AGI-3 questions leaked. Otherwise - whichever lab's model first reaches 50% should count as the answer - ignoring everything else.

I would advocate for keeping things simple, leaving https://arcprize.org/leaderboard as the

only arbiter of resolution (and using the used underlying model(s) to determine which "lab" reached 50% first, spread among the many winners if many reach it at the same time). they've said in communications that arc agi 3's point is to test general capabilities so they won't be running it against specialized scaffolds at all, but if they end up going back on that they'll probably have had a good reason to do it? i'd rather this just reflect top AI capabilities that are sufficiently not-gotchas to be judged by the arciprize team as counting

sold Ṁ45 YES

My original comment was prompted by https://blog.alexisfox.dev/arcagi3 which uses Opus 4.6 to score ~80% on the public set using a fairly trivial harness. It’s not verified (yet?) so it’s not in-bounds for resolution (yet?) but I think accepting solutions like this would turn the question into “which lab submits a benchmark-specific harness to the verification process first” which was not at all the spirit of the question…

@eapache the public set is much easier than the private set, and idk if they ll even verify these, Nor would they show up in the leaderboard (which is about the semi private set). Can resolve based on which lab had a model that hit 50% that was released first, to avoid the problem around evaluating one harness first. But fwiw none of this is imo likely to come up, maybe simpler to explicitly ban third party scaffolds

Big labs will not abandon their stack to pursue a pure path. They will layer verification on top of generation and reduce friction incrementally. A clean solution is more likely to emerge from a group not constrained by the current paradigm, and then get absorbed.