Resolve's when the o3 model is usable in the ChatGPT app (mobile or web) for people with a Plus level subscription (which at the time of writing is $20/month).
@gallerdude isn't "before march" going to be the market that supplies the most informative content though?
@MalachiteEagle looks like people are betting that o3 will require a ChatGPT Pro subscription, at least initially (I disagree with those people)
@JoshYou yeah seems odd that they would only limit it to a tiny percentage of their users. The whole point of the new double scaling law paradigm is that if you continue the RL training then you get an improved model for the same inference-time compute cost. They can entirely release o3 with fairly low compute settings and just call that "o3", and have different names for higher compute tiers like "o3-pro" etc
@4fa oh, I see what you mean. I guess this’ll have to resolve in 2027. If it doesn’t resolve by then, it’s a NO.
interesting part of this is the cost to run:
https://arcprize.org/blog/oai-o3-pub-breakthrough
$7,000 to get 83% on the eval. That's definitely not coming to a $20/month subscription unless there's a breakthrough on compute or very stringent limits on usage.
@MalachiteEagle orders of magnitude more expensive. We will see but that’s a log scale on x with linear y.
@LiamZ sure, but the versions they tuned for arc agi are intended to use the available compute openai allocated for the benchmark evaluation. This is going to be very different from the product they will release to their users. It seems very unlikely that they will release a version of o3 that, on average, takes more than 3 orders of magnitude more compute to respond per task than their o1 release model
@LiamZ the version they release to their users will neither be the version tuned on arc agi, nor will it consume 7k dollars of compute per response
@MalachiteEagle I agree, I think it’s most likely we’ll see a cheaper “o3-mini” type model at entry subscription tier first and possibly only that at the $20/month price for quite a while. Even the “high efficiency” version here was $20 per task which is a lot more than o1 and they’re currently losing money on o1.
@LiamZ I think that's still only relevant to arc-agi for the 20 dollars per task. It's in no way representative of what the "full o3" will cost openai when they release it to users. Openai just threw a load of compute at the benchmark with different settings to produce a nice graph with numbers that look good.
@LiamZ the full o3 version they release to users can absolutely be tuned so it costs openai 10 cents per request for example
@LiamZ the advantage of running the RL training for longer is that they can get improved performance for the same inference-time compute settings. OpenAI doesn't have a strong incentive to make the "full o3" model compute hungry if they have a 200 dollar per month "o3 pro" tier above it
@LiamZ if anything I think they're likely to keep the compute allocation roughly the same between "full o1" and "full o3" at least for the initial release
@MalachiteEagle we’ll see obviously but my current world model for o3 is that the main driver of o3 performance it guessing how “thinking time” to allocate to a task, generating many different private CoT, and then discriminating between them.
I think currently, if they lobotomize it by restricting it to only one CoT to cut cost then it won’t be a noticeable improvement over o1 and that risks killing hype. The one thing they actually rely on right now is investment cash flow so hype is the one thing they need.
Since they rolled out the $200 a month subscription tier, I could see them keeping higher compute for the o3 branding, restricting it to that tier, and giving a lower cost version with a slightly different name to the $20 tier for a while.
I don’t know obviously, that’s the point of this game but if there’s additional information or I’m wrong about how o3 is getting those performance improvements, I’d definitely be interested.
@MalachiteEagle from that page:
For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model.
It’s that on the fly CoT search process that they are throwing more or less compute at in my understanding.
@LiamZ I think that o3 doesn't have an evaluator model, it's just an autoregressive LLM that's been trained with RL, and it performs the search in-context. They can fine-tune the model to terminate more or less rapidly, and it uses positional encodings to keep track of how many tokens of thought it has used to far. It's the model that chooses when to halt, and they can fine-tune the release model to simply perform a more rapid search of the solution space. There may also be items in the system prompt which direct the model towards a more efficient search.
@LiamZ I agree that it seems plausible OpenAI will create an even higher tier than the 200 dollar one. Think they'll have a range of o3 models across all paying tiers though.
@LiamZ I think the main factor right now is the lack of competition for o3-class models. Google and Anthropic are likely 12 months away (but maybe less!) from producing a model with the same reasoning capabilities. OpenAI doesn't have a strategic incentive to turn up the dial for the inference-time compute of their release o* models for now. They can always do this later as their competitors start to encroach on the benchmark/elo rankings.
@MalachiteEagle my understanding it that it is not just an LLM with RL but,
The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.
I don’t have more reliable information than this ARC page at the moment but have broadly agreed with Chollet on prior LLM evaluations and his views of AGI vs memorization.
I will admit, despite being very early to openAI and having private access before chatGPT existed, my degrees are physics so I’m still catching up on some of these emerging techniques and need a better mental model of how the search is performed in that token space.
@LiamZ No, Chollet is wrong. The reason that he is speculating about the use of an inference-time evaluator model is that he is strongly averse to admitting that he was wrong about LLMs being able to get a high score on his benchmark. However, these days he is fairly open in terms of public discussion and debate and is likely to come around to this realisation within the next year or so. Either way he's made a very real contribution to the field with this benchmark (just maybe not in the way he intended/expected) and will no doubt continue to do so with its next iteration.
@MalachiteEagle I’m actually not guessing they will make a higher than $200/month tier for individuals. I’m guessing they will put o3-mini (or o3-low or something) at the Plus tier (this market) and the “full o3” at the Pro tier to start and then maybe drop it down to Plus when making OmniUltra5 or whatever they decide to name the next big “thinking” model.
I don’t know obviously, but I’m betting YES on some months for the Pro market and NO on those months here.
@MalachiteEagle ah, until I have more reliable information I’m going to put more credence in Chollet’s impression having worked with o3 and hung out with Altman, et al. than this comment thread, sorry.
@MalachiteEagle sure, I could still be off in timing here but I’d need more evidence to put credence in that explanation.
@LiamZ that's entirely ok 😂 no-one should blindly trust what anonymous accounts on the internet say, they could be anyone!
@LiamZ Arc assumed that o3 has the same price per token as o1 when calculating those costs.... I'm not sure if that will turn out to be correct, but either way, I don't think o3 is vastly more expensive than o1.
@LiamZ Here's Nat McAleese on o1/o3:
https://x.com/__nmca__/status/1870170101091008860
In practice I suspect that they're stretching the definition of "just" an LLM. They probably have special tokens and sparsity tricks which affect context management and possibly enable backtracking (+~ supressing erroneous/useless paths from causal attention mask). There's also a good chance that they've figured out a way to include some kind of self-evaluation and tree search in-context. So this is all being done with an autoregressive LLM, just with some very clever tricks to enable it to do many things people didn't know an autoregressive LLM could do.
I suspect there are other details where o1 and o3 differ in significant ways. For instance, running multiple paths in parallel over the batch dimension at inference time, and dynamically switching these over according to some form of model decision / voting process.
These are details that OpenAI are saying very little about so far, but it seems obvious that just running a single CoT for hundreds of millions of tokens is an inefficient way of utilising the resources of the server-grade Nvidia GPUs these models are running on. It's likely that there is some form of parallelism at work in the o3 model, but it's possibly not all that sophisticated. Subsequent iterations of these models may substantially improve this aspect of inference-time scaling.
@LiamZ And so coming back to what you were saying above, yes it sounds plausible that there is some form of configuration for o3 which sets a maximum number of parallel CoT paths. They may have "set this dial to 11" when running on arc-agi, and will set it far lower for their release version.
@MalachiteEagle this is good info and discussion, my basic model of the “breakthrough” is that the differentiation of o3 vs prior models is the multiple, likely parallel CoT. That ARC post is says,
At OpenAI's direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).
My understanding is that the “high efficiency” one is still too expensive for the $20/month tier. My guess for this market right now is that turning it down below that risks giving people the impression it isn’t much better than o1 so they would want to avoid the impression that it’s the “full” o3.
@MalachiteEagle also worth mentioning this isn’t a deeply held belief. Regardless of anything about the tech itself OpenAI has been wildly unpredictable with naming. When I did have that early access to GPT3, I’d never guess the series of most my most used models years later would look like GPT-4o -> o1 -> (maybe) o3. I think we agree an o3 will be available in Plus, I’m just guessing the “full” one it will be in Pro for a while first.
@LiamZ lots of interesting details in the deepseek-r1 technical report on these questions
@LiamZ coming back to the hypothesis that this comment from chollet is probably wrong:
"For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model."