How much cheaper to use will o3-equivalent or better models get before 2026?

1kṀ4860

Dec 31

95%

≥2x

83%

≥5x

71%

≥10x

55%

≥30x

39%

≥100x

Any model with publicly known benchmark scores and inference costs goes, not just OpenAI's o series.

I will consider a model to be "o3-equivalent or better" if it scores ≥25% on FrontierMath (o3 scored 25.2%) and performs similarly on other benchmarks.

(Note that o3's exact inference costs in the configuration used for benchmarking are currently unknown IIUC, though this market description will be updated with exact figures if they become public. This market can still resolve even without exact figures if e.g. OpenAI announce an o4 that's "10x cheaper" for roughly the same performance.)

Technology

Technical AI Timelines

Get

1,000

to start trading!

People are also trading

Will o4 be released before 2026?

52% chance

Will o3.5 be released before 2026?

4% chance

Will o5 be released before 2026?

20% chance

Which of the following "colors" of hydrogen will be the cheapest by 2030? [details in description]

Will there be a reasoning model more powerful than o1-preview, and cheaper and >10x faster than o1-mini, by Nov 12 2025?

84% chance

Will OpenAI launch a model even more expensive than o1-pro in 2025?

39% chance

Will producing deuterium be cheaper in 2027 than 2025?

62% chance

What will the inference cost of the best publicly available LM be in 2030?

Will a single model have all the upsides o1-style RL with none of the downsides at 2027?

58% chance

National Average Gas Price in the US by EOY 2025

6 Comments

25 Holders

74 Trades

Sort by:

bought Ṁ40 YES

deepseek-r1-new is ~as good (but probably like a bit worse) and ~72x cheaper. nuts

Double scaling law, they just need to run the RL/CoT training for longer to get better perf with a more efficient model.

o4-mini might be cracked

this may be hard to resolve because the inference costs for specific benchmark performances or tasks can vary so much.

bought Ṁ10 YES

@JoshYou as a concrete example, let's say o4 costs the same per-token (for simplicity) and can achieve 25% on FrontierMath with 1/10 as many tokens as o3 did, but requires 1/5 as many tokens to match o3 on ARC-AGI.

What's worse, those ratios probably vary a lot depending on the performance thresholds with a given benchmark. For example, it's over 100x more expensive to get 88% on ARC-AGI with o3 than it is to get 76% on ARC-AGI with o3. So it could turn out that o4 is 5x cheaper than o3 at the 76% threshold, but over 100x cheaper at the 88% threshold.

@JoshYou Hmmm... Yeah, there might be a relatively high chance of this resolving N/A when you put it that way, but I'll do what I can when the time comes.