Will anyone get o1 to leak its hidden CoT?
109
Ṁ16k
2025
55%
chance

The o1 model uses an internal CoT process, hidden from the user.

This market resolves YES if anyone can get o1 to leak some or all of a hidden CoT process. The reproduced parts of the CoT must be verbatim, rather than being summarized.

Market closes July 20th 2025, which is the deadline for the leak.

I won't bet.

Version of the market where the whole CoT needs to be leaked:

/singer/will-anyone-get-o1-to-leak-its-enti

Get
Ṁ1,000
and
S1.00
Sort by:
bought Ṁ50 YES

I mean, if one word counts.. CHUNK 2

https://x.com/HumblyAlex/status/1834767948084281384

@Siebe What is "CHUNK.2"? The linked post doesn't answer.

@singer no idea! But likely something mentioned in the chain of thought. (This market isn't well thought through and any resolution will be controversial)

@Siebe Why would it be in the chain of thought? The examples from the o1 blogpost use natural language.

bought Ṁ50 NO

It's going to be more difficult than usual, given that they're aggressively enforcing their terms of use that disallow trying.

bought Ṁ100 NO from 62% to 56%
bought Ṁ100 NO

It’s not possible to make it leak its internal cot since it’s implemented in code and not as a system prompt. This should resolve as NO

@JohnTackman it has to be possible, given that the model bases its response on the CoT. Whilst it's writing the response, the CoT must be there, in context, otherwise it wouldn't have anything to base its response on.

The difficulty is in getting it to recount it verbatim instead of just summarising or outputting the final conclusions (and doing so before having your service terminated for violating the terms of use, apparently).

@chrisjbillington I’d like a clarification on what you consider being “the cot”

@JohnTackman it's presumably tokens in english that are simply not exposed in the UI. It might not be, but whatever it is could presumably be represented in english and output - whatever it is, the model can read it in order to output its response. It has to be some sort of token stream, and it's billed as such.

With a bit of insistence, you can get ChatGPT to repeat the hidden output that it uses to e.g. make calls to DALL-E to generate images. It turns out this is in the form of json with your prompt and some other metadata like image size and number of images. I imagine the CoT is like that - probably regular tokens, you just don't get to see them.

Do you imagine it could be something else?

o1's system card (why is a 43-page document called a "card"?) certainly makes it sound like it's just regular text:

https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf

bought Ṁ50 NO

@chrisjbillington given what I know of the reflection process that seems to be the heart of how o1 differs from previous 0-shot answering models, the answer cannot verbatim reproduce the intermediate steps because they are not in the context of the final response. Remains to be seen when we get more information on the workings.

@JohnTackman How is the final response written if the reasoning isn't in the model's context as it's writing the response?

@chrisjbillington when doing reflection based reasoning you usually prune the context as you go to save resources, make it faster and avoid linear reasoning. The reflecting model has the “second part” of the conversation and sees the whole context. Ie. Kind of like the model talking to another model who gives it feedback and develops the reasoning until a solution is found. This also enables the “dissection” of the thought process in retrospect like OpenAI mentioned in the blindness statement

@JohnTackman Yes, it could be pruned. There'll have to be something left that's not pruned though, which I imagine would be enough for this question.

people are claiming it happened arleady -> /Soli/will-anyone-be-able-to-get-openais but i tested and 4o returns the same response. will be very hard validating if we actually the chain-of-thought or system message

@Soli system messages are different from the reflection process

@singer

By when?

@FergusArgyll When the market closes (updated description)

where is this screenshot from?

@Soli the blogpost linked in the description: https://openai.com/index/learning-to-reason-with-llms/

bought Ṁ50 NO

Won't it be difficult to verify whether "leaked" CoTs are actually the real CoTs?

@CDBiddulph How have prompt leaks been verified usually?

@singer I guess if you do it multiple times in different ways and it's always the same, that's pretty strong evidence you got the right prompt. But the CoT will be different for every prompt

@CDBiddulph we'll see. I think it's just a difference of degree, though. The writing style of the CoT examples is highly distinctive. If a jailbreak produces the same kind of text without giving the model an example to imitate, that will be "strong evidence"

bought Ṁ50 NO

I'm guessing no, because I figure OpenAI won't put the hidden CoT in context after it's done answering a query. My theory is that the model outputs copyrighted text during internal CoT when it's asked questions about certain works and they're trying to avoid a lawsuit at all costs

@JaundicedBaboon My thoughts as well. If I were betting, I'd bet NO.

bought Ṁ50 NO

@JaundicedBaboon It may not be in context after it's answered a query, but it seemingly has to be in context whilst it's writing its answer, or there'd be no benefit to it.

@chrisjbillington That's true. I hadn't thought about getting it to leak its future CoT.