Based on this tweet from Spencer Schiff:
https://x.com/spencerkschiff/status/1910106368205336769?s=46&t=62uT9IruD1-YP-SHFkVEPg
and the market by @bens:
/bens/will-a-mainstream-ai-model-pass-the
(Note: the main difference with that market is that the models I accept do not need to be free to use, but can cost reasonable amounts of money, typical of current frontier models on the API.)
Resolves YES if a mainstream AI model (Currently, only Google, OpenAI, Anthropic, xAI, DeepSeek and Meta count, but if some other lab becomes similarly mainstream they will count as well) can repeatedly solve this benchmark, using this image and other similar ones I draw. I will be the judge of this.
“Solving the benchmark” means being able to match the names to the colors of the stick figures, repeatedly and with simple prompts.
I will pay for api credits if needed (limit: ~$20 per model). I will prompt each AI model a few times (6) and see if it succeeds at least 4 times out of 6, but I won't be running the same model multiple times and use some kind of cons@32 technique to aggregate multiple attempts.
The image for reference in case the tweet is deleted:

…and here is what ChatGPT gives @bens now:

…and Gemini 2.5

I will not bet in this market to remain objective.
Update 2025-04-16 (PST) (AI summary of creator comment): Prompt Guidelines Clarification:
Prompt Length: Use roughly one short sentence to describe the desired task.
Allowed Adjustments: Small amounts of prompt engineering (for example, suggesting the use of a built-in tool) are acceptable.
Disallowed Complexity: Extensive multi-step instructions (e.g. a 20-step breakdown) are not permitted.