
By March 31, 2025, will an open-source AI model—with weights available for commercial use and requiring attribution similar to Meta’s Llama—be released that outperforms OpenAI’s new o1-preview model on established benchmarks?
1. Time Frame: deadline as the end of the first quarter in 2025 (March 31, 2025).
2. Criteria for the Open-Source Model:
- Availability: Weights must be available for commercial use.
- Attribution and license: must resemble what Meta and others have previously done in the past.
3. Performance Benchmark: The model must outperform OpenAI’s new o1-preview model on established benchmarks (at least 2 major ones) that it currently leads on.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ639 | |
2 | Ṁ331 | |
3 | Ṁ324 | |
4 | Ṁ251 | |
5 | Ṁ248 |
People are also trading
Qwen's QwQ has met all criteria for the challenge.
- It beats 01-preview on both AIME and MATH-500
- Is available for commercial use under the Apache 2.0 license
- Has weights available on Hugging Face
https://qwenlm.github.io/blog/qwq-32b-preview/
@MalachiteEagle the only model available with metrics of o1-preview. This is also made clear in the description.
Score for o1 just posted at 1355. It's further ahead than I thought, so I think this might make this market less likely to resolve Yes.

O1 seems particularly strong at coding tasks, so you should probably specify which benchmark you will use
And I bet on the previous spec before it was changed. I would prefer this market be N/Aed and a fresh market made
"3. Performance Benchmark: The model must outperform OpenAI’s new o1-preview model on established benchmarks (at least 2 major ones) that it currently leads on."
There are so many ways of evaluating, and so many benchmarks out there. IMO a lot to gain from specifying concretely e.g. lmsys code, GPQA, SWE-bench etc. @JohnL ? Probably worth further specifying: use best available result (any scaffold) at time of resolution for both O1 and OSS contender.
The model must outperform OpenAI’s o1 preview (or full) model on at least two widely recognized AI benchmarks
That's already true today. https://github.com/openai/simple-evals?tab=readme-ov-file#benchmark-results
Mgsm and drop are higher on llama 405
o1 is specialized for stem so sucks at creative writing. Wouldnt be surprising for any model to beat it at two major creative writing benchmarks, or something like that
maybe. If so i think the title or description should make clear that the open source model has to beat o1 on things o1 is good at
Does 'o1' refer to the recently released 'o1-preview', the upcoming 'o1' (which OpenAI has claimed to be meaningfully better than 'o1-preview'), or whatever the best iteration of o1 is that is publicly available by the deadline? What happens in the second case if 'o1' isn't released by the deadline?
Because that's not a "clarification", that's a substantial change from one model to a different model