Skip to main content
MANIFOLD
Will a mainstream AI model pass the stick figure arrow name test in 2026?
48
Ṁ1kṀ6.9k
resolved Mar 16
Resolved
YES

Based on this tweet from Spencer Schiff:

https://x.com/spencerkschiff/status/1910106368205336769?s=46&t=62uT9IruD1-YP-SHFkVEPg

and the market by @bens:

/bens/will-a-mainstream-ai-model-pass-the

Resolves YES if a mainstream AI model (Currently, only Google, OpenAI, Anthropic, xAI, DeepSeek and Meta count, but if some other lab becomes similarly mainstream they will count as well) can repeatedly solve this benchmark, using this image and other similar ones I draw. I will be the judge of this.

“Solving the benchmark” means being able to match the names to the colors of the stick figures, repeatedly and with simple prompts.

I will pay for api credits if needed (limit: ~$20 per model). I will prompt each AI model a few times (6) and see if it succeeds at least 4 times out of 6, but I won't be running the same model multiple times and use some kind of cons@32 technique to aggregate multiple attempts.

The image for reference in case the tweet is deleted:

Models in 2025 could not solve this problem.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ548
2Ṁ400
3Ṁ269
4Ṁ203
5Ṁ200
Sort by:

anyone try gpt5.4? or no reason to?

gpt-5.4 extended thinking

Bob = blue

Jack = green

Jimmy = beige / light tan

Tom = yellow
Adam = pink

2/5

ok my brother tried it with gpt 5.4 pro and it succeeded 4 out of 4 times

@Bayesian well dang, i was just thinking about this today as something llm·s can't do yet™

This works pretty well.

https://g.co/gemini/share/2f3ec6e34679

However, sometimes it mixes two up.

This doesn't seem intractable relative to other things like reasoning, coding, etc. If the labs put a few months of research effort into vision reasoning I think it would basically be solved. It mostly comes down to whether stuff like open-ended RL or continual learning ends up being a higher priority (I think it might be, since those seem to have a quicker payoff for AC or SAR)

_