Will resolve to YES if Meta releases an open source model that acheives a higher average score than GPT-4 on the following benchmarks by the end of 2024:
HellaSwag (few-shot): 0.953
MMLU (few-shot): 0.864
AI2 Reasoning Challenge (ARC): 0.963
Update 2025-03-01 (PST) (AI summary of creator comment): - Llama 3.1 405B achieves the following benchmark scores:
MMLU (zero-shot CoT): 0.886
ARC (zero-shot): 0.969
Hellaswag score is not reported due to potential contamination and is not considered in the resolution criteria.
The model is deemed open-source, and based on its performance and higher Elo on LMSYS, the market is resolved to YES.
According to the Llama 3.1 release [1], Llama 3.1 405B achieves the following scores on two benchmarks in the original question:
MMLU (zero-shot CoT): 0.886
ARC (zero-shot): 0.969
However, they do not report a score on Hellaswag, and there does not seem to be reliable third-party reports of Hellaswag for Llama 3.1 405B Instruct elsewhere, likely due to contamination as noted in the Llama 3.1 paper [2]. Based on improved performance on the above benchmarks, as well as higher elo on LMSYS. I am inclined to say that Llama 3.1 405B does indeed "outperform" GPT-4.
Whether or not Llama 3.1 405B is truly "open-source" is a debated topic, however I am considering it to be open source.
For this reason I am resolving the market to YES.
[1] https://ai.meta.com/blog/meta-llama-3-1/
[2] https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
Related news: Meta currently training Llama 3, and plans to ramp up to almost 600K H100s equivalent compute by the end of the year https://www.instagram.com/reel/C2QARHJR1sZ/.