Will a mainstream AI model pass the stick figure arrow name test in 2025?

1kṀ5673

Dec 31

61%

chance

ALL

Based on this tweet from Spencer Schiff:

https://x.com/spencerkschiff/status/1910106368205336769?s=46&t=62uT9IruD1-YP-SHFkVEPg

and the market by @bens:

/bens/will-a-mainstream-ai-model-pass-the

(Note: the main difference with that market is that the models I accept do not need to be free to use, but can cost reasonable amounts of money, typical of current frontier models on the API.)

Resolves YES if a mainstream AI model (Currently, only Google, OpenAI, Anthropic, xAI, DeepSeek and Meta count, but if some other lab becomes similarly mainstream they will count as well) can repeatedly solve this benchmark, using this image and other similar ones I draw. I will be the judge of this.

“Solving the benchmark” means being able to match the names to the colors of the stick figures, repeatedly and with simple prompts.

I will pay for api credits if needed (limit: ~$20 per model). I will prompt each AI model a few times (6) and see if it succeeds at least 4 times out of 6, but I won't be running the same model multiple times and use some kind of cons@32 technique to aggregate multiple attempts.

The image for reference in case the tweet is deleted:

…and here is what ChatGPT gives @bens now:

…and Gemini 2.5

I will not bet in this market to remain objective.

Update 2025-04-16 (PST) (AI summary of creator comment): Prompt Guidelines Clarification:
- Prompt Length: Use roughly one short sentence to describe the desired task.
- Allowed Adjustments: Small amounts of prompt engineering (for example, suggesting the use of a built-in tool) are acceptable.
- Disallowed Complexity: Extensive multi-step instructions (e.g. a 20-step breakdown) are not permitted.

Technical AI Timelines

AI Impacts

AI Safety

AI Alignment

Get

1,000

to start trading!

People are also trading

What will be the best AI performance on Humanity's Last Exam by December 31st 2025?

Which AI will be the best at the end of 2025?

Will an AI model surpasses o3's matharena.ai 88% Overall score by July 1, 2025?

50% chance

Will any AI model score above 95% on GRAB by the end of 2025?

40% chance

Will AI pass the Bob Ross Turing Test by 2035?

75% chance

Will AI pass Video Turing Test by 2030?

63% chance

Will an AI model outperform 95% of Manifold users on accuracy before 2026?

49% chance

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

11% chance

Will a smart agent pass our Turing test by the end of 2025?

56% chance

On Dec 31, 2025, will a widely available AI model be able to write a sophisticated 2000 line program?

Sort by:

soldṀ50NO

@ian @rohanvisme any news?

@Bayesian Just saw the tweet you posted 5 days ago

@Bayesian nada but given the timeline and utility of the task, I'd figure someone in a lab will take it upon themselves to fix this (just like strawberry)

Also I'm very bored.

https://www.reddit.com/r/OpenAI/comments/1k0z2qs/o3_thought_for_14_minutes_and_gets_it_painfully/

https://x.com/lukeprog/status/1912592191282712777

Hey @Bayesian how "simple" does the prompt need to be. Can there be a small amount of prompt engineering, or like NONE.

@bens the wording is not set in stone, but it should be roughly one short sentence explaining the desired task. so pretty simple, but i will accept small amounts of prompt engineering suggestions. for example, if in a year the llm obviously has the capability if you mention to the ai "oh btw use your 'follow-arrows' tool", i will mention that to the ai if that helps. but no like 20 step plan where you explain how it should break down the problem or wtv

i am begging someone to try o3 on this

@Bayesian on it

@Bayesian https://chatgpt.com/share/67fff082-0704-800f-91ab-cb90cf0b4841

@Bayesian o3 reasoned for 14 minutes about this, during which it did a lot of cropping and zooming in on different parts of the image. I could see from the Reasoning text which flashed by that it had made substantial errors, for example "these two loops aren't part of any arrows". After all that it returned

"To="

and nothing else. I'm going to try again with a tweaked prompt but so far it feels like watching Claude play pokemon.

@Bayesian oops ya I tried o3 and 04-mini right away and did some prompt engineering but could get neither to do anything intelligent with this. I think that non-gestalt image processing is like… not at all native for these things, and that the zoom tool helps only insofar as it lets it do gestalt image processing for smaller subsections of the image.