When will the first model reach 50% on ARC-AGI-3?

MANIFOLD

Ṁ175Ṁ2.3k

2029

January 31, 2028

ALL

1.1%

Before July 1 2026

86%

Before January 1 2027

97%

Before July 1 2027

98%

Before January 1 2028

98%

Before July 1 2028

98.1%

Before January 1 2029

Any time after January 1 2029

I'll use the verified leaderboard at https://arcprize.org/leaderboard as the authoritative source of scores - claims that aren't officially verified don't count for this market.

ARC Prize - Leaderboard

The ARC-AGI Leaderboard.

Update 2026-04-14 (PST) (AI summary of creator comment): The ARC-AGI-3 scoring rules have changed since this market was created. The creator will resolve based on the new (more generous) scoring rules as reflected on the public leaderboard, since the original scoring data is not publicly available. Key changes to scoring:
- Per-level baseline now uses median human player (previously 2nd-best player)
- Per-level score cap increased from 100% to 115%
- Net effect: scores increase by ~+0.5pp for both humans and AI

Market context

Technology

Technical AI Timelines

OpenAI

AI Impacts

Get

1,000

to start trading!

Sort by:

According to Anthropic, Opus 5 scores 30.2% on the benchmark, approximately tripling the previous high score. see https://www.anthropic.com/news/claude-opus-5

Introducing Claude Opus 5

Opus 5 is a step change improvement for the Opus tier powering long-running agents while delivering improvements in coding and professional work.

There have been some changes to the scoring:

Based on what we’ve observed, we’re announcing two updates to ARC-AGI-3 scoring:
The per-level baseline is now less sensitive to outlier performances, reducing the impact of luck on individual levels.
A single unusually efficient human run no longer defines the baseline for ARC-AGI-3 scoring. Rather the baseline now reflects more typical human play. Technical change: the human baseline which normalizes scores moves from 2nd-best player to median player per level.
A single subpar level no longer disproportionately drags down an overall score
A test taker who generalizes well across an entire environment is no longer penalized by a single, sub-par, level. Technical change: per-level score cap increases from 100% to 115%.
The net result of these changes is a marginal increase in scores for both humans and AI (+0.5pp) and better reflects our desire to fairly compare efficiency between test taskers.

https://arcprize.org/blog/arc-agi-3-human-dataset

I'm pretty annoyed by this. The literal reading of this market's title and description says that it resolves based on the public leaderboard, but what the public leaderboard means has changed since the market was created, and people have made trades based on the original meaning. If it were straightforward, I think I'd resolve this based on the original scoring rules and add a note. But AFAICT there's no way to see what scores would be under the original rules (they don't publish the results on private environments), so I guess we're going with the new (more generous) scoring rules now.