Will OpenAI release a model which performs chain-of-thought on image tokens / visual scratchpad before 2026?
10
100Ṁ353
resolved Apr 18
Resolved
YES
40

Resolves as YES if OpenAI releases a model which performs chain-of-thought on image tokens or which uses a visual scratchpad before January 1st 2026

  • Update 2025-04-18 (PST) (AI summary of creator comment): Tool use for chain-of-thought is valid.

    • If the model produces chain-of-thought on image tokens using external tools, it meets the criteria.

    • The model does not need to generate the tokens directly; using its built-in tool capabilities is sufficient.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ26
2Ṁ24
3Ṁ23
4Ṁ20
5Ṁ3
Sort by:

@traders I'm inclined to resolve this as YES

seems like o3 can do chain of thought on images using tools

@MalachiteEagle Probably the best reason not to resolve is that the images are generated/modified using tools rather than by the model directly. But you didn't mention this in the resolution criteria, so resolving as YES seems reasonable

@CDBiddulph yeah I was thinking of the model producing these tokens directly, but tool use does count. Especially since the model can do all of this out of the box.

Weijia Shi is an authority on this, given her publication record. If she says it's chain of thought on images then it's chain of thought on images 🤷

bought Ṁ30 YES

@traders I'm inclined to resolve this as YES Edit: no this isn't chain-of-thought. This is just in-context-learning.

input image:

output image:

It's in-context-learning the regular polygon drawing task

Like it's not writing its thoughts down. But I think someone might get gpt-4o to express reasoning steps in the image domain but just coming up with a simple image-based prompt to elicit it.

Human language is a linear, and isn't well suited to exploring multiple lines of reasoning in parallel. We have parentheticals and footnotes, but they're weak tools for it.

I don't know that images will be used, but I do think that at some point, there will be a format that:

  • Is in a "language" that an LLM created, expressly for the purpose of communicating with other AI that were not trained on the same datasets as itself. A machine Esperanto.

  • Uses more explicit context, and uses far fewer implicit contextual clues. Machines aren't limited in reading/writing speed like humans are, they don't need to lean on unstated assumptions in their writing.

  • Is 2D in nature. The language will support branching and exploring multiple ideas in parallel while all the "threads" of thought/communication use a shared context. Possibly 3D or higher. This threaded communication (and thought-process) can be authored in parallel.

For images in particular, sure, I think there will be a video-generation product that outputs a "storyboard". The storyboard gets refined with user input, and only then does the process start doing low-res preview renders for futher iteration. A product that works more like a movie production workflow, rather than only operating off of a script.

@DanHomerick seems reasonable. I think there are many different modalities that can be described, and it might make sense to define reasoning tokens for each modality. These can coexist with some universal token representation.

Guess we'll find out what works!

I think also a lot of this will change as the topology of the reasoning process itself stops being a linear sequence, and as the field moves away from discrete tokens. Things may look wildly different a decade from now.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules