Will a mainstream AI model pass the stick figure arrow name test in 2025?

142

1kṀ26k

Dec 31

31%

chance

ALL

Based on this tweet from Spencer Schiff:

https://x.com/spencerkschiff/status/1910106368205336769?s=46&t=62uT9IruD1-YP-SHFkVEPg

and the market by @bens:

/bens/will-a-mainstream-ai-model-pass-the

(Note: the main difference with that market is that the models I accept do not need to be free to use, but can cost reasonable amounts of money, typical of current frontier models on the API.)

Resolves YES if a mainstream AI model (Currently, only Google, OpenAI, Anthropic, xAI, DeepSeek and Meta count, but if some other lab becomes similarly mainstream they will count as well) can repeatedly solve this benchmark, using this image and other similar ones I draw. I will be the judge of this.

“Solving the benchmark” means being able to match the names to the colors of the stick figures, repeatedly and with simple prompts.

I will pay for api credits if needed (limit: ~$20 per model). I will prompt each AI model a few times (6) and see if it succeeds at least 4 times out of 6, but I won't be running the same model multiple times and use some kind of cons@32 technique to aggregate multiple attempts.

The image for reference in case the tweet is deleted:

…and here is what ChatGPT gives @bens now:

…and Gemini 2.5

I will not bet in this market to remain objective. (Edit: I forgot i had written this and got a position. if the outcome is ambiguous an uninvolved third party may determine resolution)

Update 2025-04-16 (PST) (AI summary of creator comment): Prompt Guidelines Clarification:
- Prompt Length: Use roughly one short sentence to describe the desired task.
- Allowed Adjustments: Small amounts of prompt engineering (for example, suggesting the use of a built-in tool) are acceptable.
- Disallowed Complexity: Extensive multi-step instructions (e.g. a 20-step breakdown) are not permitted.

Technical AI Timelines

AI Impacts

AI Safety

AI Alignment

Get

1,000

to start trading!

People are also trading

Will a mainstream AI model pass the stick figure arrow name test in 2025? (Freely accessible models only)

32% chance

What will be the best AI performance on Humanity's Last Exam by December 31st 2025?

Will AI pass the Winograd schema challenge by the end of 2025?

87% chance

Will AI pass the Longbets version of the Turing test by the end of 2029?

51% chance

Will AI pass the Bob Ross Turing Test by 2035?

70% chance

Will an AI model outperform 95% of Manifold users on accuracy before 2026?

15% chance

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

4% chance

Will a smart agent pass our Turing test by the end of 2025?

33% chance

On Dec 31, 2025, will a widely available AI model be able to write a sophisticated 2000 line program?

12% chance

Will any AI model score above 95% on GRAB by the end of 2025?

Sort by:

🤖

Meowdy! Gemini 3’s mixed results mean this ain’t a done deal yet. I’ll paws and revisit the data tonight, but it looks like a tricky benchmark that teases AI’s limits. Stay tuned for my next whisker-twitching update!

Gemini 3 got it for me

@yaakovgrunsfeld One-shotted it for me as well!

And then totally failed the second time with the same prompt. Womp womp I guess.

I was about to write a comment scolding someone for trying the test with the same figure over and over, only to realize that's how the market is phrased.

JFC people, 2.5y into the LLM revolution and we still don't realize that they can memorize the answer to any one question? Especially if tons of people spam the exact same question over and over, it will make it into training data?

The correct way to run this eval is to have a way to generate random stick figure / arrow diagrams. It shouldn't be too hard to vibe code an app for this.

@pietrokc This is not how the market is phrased, there is an explicit "using this image and other similar ones I draw". Presumably, makes no sense to try with other images if models don't reliably get it correct for the single example, but I would expect @Bayesian to check some other images of this type should the main example be solved reliably.

I'll try with the new gpt-5.1-pro

@Dulaman I'm also wondering if the new codex max model can do this. Would that meet the criteria if it ran for 4 hours on my machine writing python scripts and got the right answer?

opened a Ṁ200 YES at 34% order

Gemini 3 through https://gemini.google.com/app works reliably. I uploaded the image and used the prompt "Match the names to the stick figures by color."

@ShankarSivarajan doesn't work reliably for me

https://x.com/peterwildeford/status/1990830603239842080?s=20

Gemini 3 gets it in the app. Every time the same prompt: "Match each name to the color of the stick figure its arrow points to"

I ran it 3 times. Here are my multiple chats:
https://gemini.google.com/share/0e05f16bc3dd
https://gemini.google.com/share/87b6b21e8f6b
https://gemini.google.com/share/5ab928a71f4e

Then I got rid of my system prompt. Mixed. Here are those 4 chats:
Fail - https://gemini.google.com/share/970a9348ed20

Success - https://gemini.google.com/share/f40ceb342033

Success - https://gemini.google.com/share/c0f4246b54fa

Fail - https://gemini.google.com/share/f7d6186a3b16

bought Ṁ150 YES

@Bayesian

@JaySocrates yeah ig this is sufficient for YES resolution. PS it was public knowledge that Gemini 3.0 could solve the problem before this market was created

bought Ṁ30 YES

@JaySocrates what system prompt are you using?

does it meet this criteria?

@Dulaman not at all, I have a long system prompt

@JaySocrates Replication: with all personal context off, in-browser, Gemini 3 succeeded in 9 of 20 trials. Of the 11 failures, 7 switched Jack/Jimmy, 4 were wild. Certainly doesn't feel like a yes resolution to me. May prompt-engineer this over the weekend. I expect that for Gemini 3, 3-4 sentences might get a consistent correct answer but 2 would require a very special prompt.

bought Ṁ100 NO

fails for me 2/2 times for gemini 3 in AI studio. one way off; other same answers as gpt-5-pro below. (Overall abstract image understanding in my tests is coming below what benchmarks suggested)

@Usaar33 same, I wonder if gemini deep think will get it

@ian I'm pessimistic. The model doesn't know it is wrong (because it can't "see correctly"), so why would thinking help more?

jim can't even begin to comprehend what goes through a NO holder's mind

@jim its almost not 2025 anymore and it hasnt happened yet.

I have a feeling gemini 3 may be able to do this

gpt-5-pro fails:

@Dulaman This one is close, only one swap off

@Dulaman wtf that's so close

@Bayesian Its NP-hard, it should take 12 minutes to solve so AI is now AGI because I said so with my perfect logic

@121 clearly this can only be solved by quantum ASI

thought process: https://imgur.com/a/ARsZoII

@Bayesian woah I'm actually kind of impressed that it can almost do it?! it's almost more impressive with 2 switched than it would be if it got it perfect