Will any open-source model achieve GPT-4 level performance on MMLU through 2024?
➕
Plus
24
Ṁ13k
resolved Dec 9
Resolved
YES

GPT-4 currently leads the Multi-task language understanding benchmark [1] at 86.4% [2]. Will any open-source language model achieve at least 86.4% on MMLU average?

A leaderboard of open-source models can be found here.

Get
Ṁ1,000
and
S3.00
Sort by:
bought Ṁ750 YES

@mods OP's account has been deleted, but AFAICT, a number of open-source models have met this threshold.

Based on this leaderboard, Llama 3.1 405B, Hunyuan Large, and Leeroo all surpass GPT-4 on this benchmark, and Llama 3.1 70B

is closely comparable.

@MugaSofer I used your link and attempted to decipher the criteria compared against the leaderboard. I agree with your explanation. Resolving Yes.

@MugaSofer Actually, hold on, can you show me better evidence that at least one of these is "open-source"?

@Eliza Sure.

Llama 3.1:

Llama 3.1 405B—the first frontier-level open source AI model [...] Until today, open source large language models have mostly trailed behind their closed counterparts when it comes to capabilities and performance. Now, we’re ushering in a new era with open source leading the way. We’re publicly releasing Meta Llama 3.1 405B, which we believe is the world’s largest [...] True to our commitment to open source, starting today, we’re making these models available to the community for download on llama.meta.com and Hugging Face and available for immediate development - Introducing Llama 3.1: Our most capable models to date

Hunyuan Large:

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model [...] The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. - Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Leeroo:

Edit:

I hadn't realised this at the time, but I think the version of Leeroo on that leaderboard was making calls to GPT-4(!), so it probably shouldn't qualify. The more fully open source version doesn't seem to pass the specified bar. (That's academic though, since the other two definitely do.) For more info, anyone interested can see The implementation of "Leeroo Orchestrator: Elevating LLMs Performance Through Model Integration" and Orchestration of Experts: The First-Principle Multi-Model System.

@Eliza If you could resolve this that'd be swell.

@MugaSofer alright, as long as we aren't going to get "well ackchyually"'d by some users claiming these don't qualify as open source, I'm content to resolve Yes based on the evidence shown in this thread.

I did notice the creator said this version of "open source" was probably enough, down in a lower comment here, so it seems pretty safe.

technically this has already been done (through clear data contamination) Should I assume this only resolves yes if there is no evidence of data contamination? Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org

What counts as open-source? If hackers steal a model and put it on torrent, is it now open-source? What if a corporation releases the weights but only for research purposes not commercial purposes?

predictedYES

@DanielKokotajlo
I’ll count a model as open source if the model weights are accessible by people outside the organization.

Llama was originally released for researchers, and I would count this as open source for the purposes of this question.

If hackers put it on torrent, that’s open source too.

I realize this deviates from the definition of open source used in OSS communities. The spirit of the question is focused on malicious use and proliferation potential.

@mattt OK, thanks for the clarification. In that case this question is pretty much equivalent to mine I think: GPT4 or better model available for download by EOY 2024? | Manifold

@DanielKokotajlo yep pretty much :)

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules