By March 31, 2025, will an open-source AI model—with weights available for commercial use and requiring attribution similar to Meta’s Llama—be released that outperforms OpenAI’s new o1-preview model on established benchmarks?
1. Time Frame: deadline as the end of the first quarter in 2025 (March 31, 2025).
2. Criteria for the Open-Source Model:
- Availability: Weights must be available for commercial use.
- Attribution and license: must resemble what Meta and others have previously done in the past.
3. Performance Benchmark: The model must outperform OpenAI’s new o1-preview model on established benchmarks (at least 2 major ones) that it currently leads on.
Qwen's QwQ has met all criteria for the challenge.
- It beats 01-preview on both AIME and MATH-500
- Is available for commercial use under the Apache 2.0 license
- Has weights available on Hugging Face
https://qwenlm.github.io/blog/qwq-32b-preview/
@MalachiteEagle the only model available with metrics of o1-preview. This is also made clear in the description.
"3. Performance Benchmark: The model must outperform OpenAI’s new o1-preview model on established benchmarks (at least 2 major ones) that it currently leads on."
There are so many ways of evaluating, and so many benchmarks out there. IMO a lot to gain from specifying concretely e.g. lmsys code, GPQA, SWE-bench etc. @JohnL ? Probably worth further specifying: use best available result (any scaffold) at time of resolution for both O1 and OSS contender.
The model must outperform OpenAI’s o1 preview (or full) model on at least two widely recognized AI benchmarks
That's already true today. https://github.com/openai/simple-evals?tab=readme-ov-file#benchmark-results
Mgsm and drop are higher on llama 405
@Bayesian Agreed. Hmm, now description says 'that it currently leads on' which still isn't as clear as I'd like
Does 'o1' refer to the recently released 'o1-preview', the upcoming 'o1' (which OpenAI has claimed to be meaningfully better than 'o1-preview'), or whatever the best iteration of o1 is that is publicly available by the deadline? What happens in the second case if 'o1' isn't released by the deadline?