Will an opensource LLM on huggingface beat an average human at the most common LLM benchmarks by July 1, 2024?

1.3kṀ1204

resolved Dec 23

Resolved

N/A

ALL

Open source models are measured against ARC, HellaSwag, MMLU, and TruthfulQA on https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard . I've added the following plot to this huggingface space so we can see the progress of open source models over time:

I'm wondering when the average human baseline will be passed. A linear trend indicates July 2024 with a 0.89 pearson coef:

But this trend might not be linear. This questions will resolve Yes if the average human baseline on Open LLM Leaderboard on huggingface is surpassed before July 1, 2024.

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

Technical AI Timelines

LLMs

AI Alignment

AI Safety

Get

1,000

to start trading!

People are also trading

What will be true of OpenAI's best LLM by EOY 2025?

EOY 2025: Will open LLMs perform at least as well as 50 Elo below closed-source LLMs on coding?

30% chance

Who will have the best LLM at the end of 2025 (as decided by ChatBot Arena)?

Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?

70% chance

In 2025, will I be able to play Civ against an LLM?

15% chance

Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?

49% chance

What organization will top the LLM leaderboards on LMArena at end of 2025? 🤖📊

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

75% chance

Will the most interesting AI in 2027 be a LLM?

70% chance

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

Sort by:

They kept removing my code that made this easy to see on huggingface. We have now surpassed average humans on all of these baselines, but I'm unsure of the state of things on july 1, 2024. I'm sorry I did not follow up and resolve this. for that reason I'm returning everyones funds.

@ChrisCanal resolve?

"an average human" in the title is misleading given that "average human baseline" consists of MMLU human baseline at ~90% which is expert performance instead of average human performance.

from: D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
https://arxiv.org/pdf/2009.03300.pdf