Will an opensource LLM on huggingface beat an average human at the most common LLM benchmarks by July 1, 2024?
Jul 2

Open source models are measured against ARC, HellaSwag, MMLU, and TruthfulQA on https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard . I've added the following plot to this huggingface space so we can see the progress of open source models over time:

I'm wondering when the average human baseline will be passed. A linear trend indicates July 2024 with a 0.89 pearson coef:

But this trend might not be linear. This questions will resolve Yes if the average human baseline on Open LLM Leaderboard on huggingface is surpassed before July 1, 2024.

Get Ṁ600 play money
Sort by:

"an average human" in the title is misleading given that "average human baseline" consists of MMLU human baseline at ~90% which is expert performance instead of average human performance.

from: D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

predicts YES

@Metastable yeah it's so hard to measure these things. Don't worry I have an idea to fix this.

More related questions