What will be true of o3?

1.6kṀ7138

resolved Apr 24

Resolved

YES

Releases in April

Resolved

YES

GPQA Diamond >= 80%

Resolved

YES

#1 on SimpleBench

Resolved

YES

Significantly better at creative writing than o1

Resolved

Releases in May

Resolved

Context window >=500k tokens

Resolved

General consensus it’s a stronger model than Gemini 2.5 Pro

Technology

OpenAI

LLMs

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ139
2		Ṁ137
3		Ṁ91
4		Ṁ85
5		Ṁ83

People are also trading

Will o3.5 be released before 2026?

1% chance

Does O3 use the same base model as O1? (Conditional on any OpenAI confirmation)

43% chance

How much cheaper to use will o3-equivalent or better models get before 2026?

30 Comments

56 Holders

123 Trades

Sort by:

I asked o3, and it says there is not a strong consensus that it’s better than Gemini 2.5 Pro. That one resolves as NO.

Related markets for GPT-5, trying a new approach

Now how to resolve the Gemini one…

@gallerdude 😂😂😂 good luck. I believe most of people would agree that o3 is better. But there a no absolute consensus. Good luck :)

https://x.com/andrew_n_carr/status/1913073740885508195

I find it to be less good than gpt4.5 tbh, but better than o1 for sure

@MalachiteEagle I’ve tried it a few times, and it’s kind of slop. But r/locallama really likes it, I guess I concede it is better than o1.

Simplebench resolves true

o3 gets first place https://eqbench.com/creative_writing.html

https://x.com/sam_paech/status/1912747345370075215

https://simple-bench.com

o3 (high) new #1!

Related market for 4.1

Significantly better at creative writing than o1

@MalachiteEagle I would be surprised, the reasoning models kinda ass at creativity. I think I set it so anyone can add answers tho.

@gallerdude ah nah only you can add answers

@gallerdude I'm willing to bet that they got a limited version of the RL training working on soft targets like creative writing (for the o3 release)

https://x.com/sama/status/1908167790336651720

@MalachiteEagle added it.

idk in my mental model, success at nebulous tasks like creativity are hard for RL paradigms like the o-series to succeed at, in comparison to tasks like math or coding where you can check immediately whether they succeed or not.

Sama has the tweet about the creative writing model they’re working on, but I would imagine that has to do with being less assertive during post-training. Like @gwernbranwen talks about.

@gallerdude yes I agree that the naive RL training works primarily for domains that have strong verifiability. I think there are other things going on now though at the point they are at along the RL scaling curve. They've likely added more complex verifiers and there may be generalisation from code/math to domains like creative writing.

@gallerdude also o1 is terrible at creative writing so the bar is very low

@MalachiteEagle well now that we’ve both bet on this in opposite directions, what do we want the criteria to be 😂

I’m happy with like a manifold poll, or maybe during the presentation they specifically mention increased creative writing capabilities.

@gallerdude I would ask that this not be resolved immediately after they announce it. There are some creative writing benchmarks that are starting to get popular

https://x.com/omarsar0/status/1910325041343902198

https://eqbench.com/creative_writing_longform.html

Not sure they've benchmarked o1 yet though

Think this benchmark could be a good candidate:

https://x.com/LechMazur/status/1876301424482525439

https://github.com/lechmazur/writing/

bought Ṁ5 YES

@MalachiteEagle I'd also count:

Official statements from oai that o3 is much better at creative writing
Online buzz like what happened for 4.5

@gallerdude think it happened

https://x.com/sam_paech/status/1912747345370075215

@MalachiteEagle

My only beef with that benchmark is its LLMs rating other LLMs. Let's wait to see LMArena scores, or some other human rated benchmarks.

@gallerdude that would be consistent with them using LLMs as reward for creative writing tasks in the RL post training phase

@MalachiteEagle agree, but that does not mean it’s a great writer. am curious to see other benchmarks, but honestly creative writing is one of the hardest things to benchmark

@gallerdude the only objective way to resolve a market about "creative writing" is a benchmark on creative writing. The question should resolve True, unless there are different highly contradicting benchmarks scores claiming the opposite

People are also trading

Will o3.5 be released before 2026?

1% chance

Does O3 use the same base model as O1? (Conditional on any OpenAI confirmation)

43% chance

How much cheaper to use will o3-equivalent or better models get before 2026?

🏅 Top traders

People are also trading

People are also trading

Related questions