Will a mainstream AI model pass the stick figure arrow name test in 2025?
49
1kṀ3680
Dec 31
61%
chance

Based on this tweet from Spencer Schiff:

https://x.com/spencerkschiff/status/1910106368205336769?s=46&t=62uT9IruD1-YP-SHFkVEPg

and the market by @bens:

/bens/will-a-mainstream-ai-model-pass-the

(Note: the main difference with that market is that the models I accept do not need to be free to use, but can cost reasonable amounts of money, typical of current frontier models on the API.)

Resolves YES if a mainstream AI model (Currently, only Google, OpenAI, Anthropic, xAI, DeepSeek and Meta count, but if some other lab becomes similarly mainstream they will count as well) can repeatedly solve this benchmark, using this image and other similar ones I draw. I will be the judge of this.

“Solving the benchmark” means being able to match the names to the colors of the stick figures, repeatedly and with simple prompts.

I will pay for api credits if needed (limit: ~$20 per model). I will prompt each AI model a few times (6) and see if it succeeds at least 4 times out of 6, but I won't be running the same model multiple times and use some kind of cons@32 technique to aggregate multiple attempts.

The image for reference in case the tweet is deleted:

…and here is what ChatGPT gives @bens now:

…and Gemini 2.5

I will not bet in this market to remain objective.

  • Update 2025-04-16 (PST) (AI summary of creator comment): Prompt Guidelines Clarification:

    • Prompt Length: Use roughly one short sentence to describe the desired task.

    • Allowed Adjustments: Small amounts of prompt engineering (for example, suggesting the use of a built-in tool) are acceptable.

    • Disallowed Complexity: Extensive multi-step instructions (e.g. a 20-step breakdown) are not permitted.

Get
Ṁ1,000
to start trading!
Sort by:

Hey @Bayesian how "simple" does the prompt need to be. Can there be a small amount of prompt engineering, or like NONE.

@bens the wording is not set in stone, but it should be roughly one short sentence explaining the desired task. so pretty simple, but i will accept small amounts of prompt engineering suggestions. for example, if in a year the llm obviously has the capability if you mention to the ai "oh btw use your 'follow-arrows' tool", i will mention that to the ai if that helps. but no like 20 step plan where you explain how it should break down the problem or wtv

i am begging someone to try o3 on this

@Bayesian on it

@Bayesian o3 reasoned for 14 minutes about this, during which it did a lot of cropping and zooming in on different parts of the image. I could see from the Reasoning text which flashed by that it had made substantial errors, for example "these two loops aren't part of any arrows". After all that it returned

"To="

and nothing else. I'm going to try again with a tweaked prompt but so far it feels like watching Claude play pokemon.

@Bayesian oops ya I tried o3 and 04-mini right away and did some prompt engineering but could get neither to do anything intelligent with this. I think that non-gestalt image processing is like… not at all native for these things, and that the zoom tool helps only insofar as it lets it do gestalt image processing for smaller subsections of the image.

bought Ṁ30 YES

o1-pro fails

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules