Will Grok 2 'exceed current [March 28 2024] AI on all metrics'?
97
843
1.3K
2026
38%
chance

On March 29, Elon Musk tweeted this (https://twitter.com/elonmusk/status/1773655245769330757 ):

"Should be available on 𝕏 next week. Grok 2 should exceed current AI on all metrics. In training now."

Is that so? Let's find out.

Note that for this purpose it counts as 'Grok 2' even if it is renamed - the only way a newly announced xAI model does not count as that is if it is named Grok 1.X, or otherwise is clearly pre-2, but the thing in training now counts whatever they ultimately call it, if they release it, etc.


Resolves YES if Grok 2 is released and it exceeds or ties (to 1 decimal place) Claude 3 Opus and all metrics for models available to the public in some form on or before 3/28/24 on everything in this chart:

So MMLU, GPQA, GSM8K, AMTH, MGSM, HumanEval, DROP, Big-Bench-Hard, ARC-Challenge and HellaSwag.

Resolves NO if Grok 2 is released and does NOT exceed or tie these numbers on one or more of these metrics, or if Grok 2 is not released by EOY 2025.

If xAI does not test on all of these metrics, but it succeeds on all metrics that it does test, and there is no way to test on the others, I will use best judgment - if it clearly would have exceeded I will still resolve YES, but by default (or if it would have been close) I will assume they chose which metrics to test on based on results, and be inclined to count that as NO. Will clarify further if this gets a lot of interest, as needed.

Get Ṁ200 play money
Sort by:

How do you adjudicate differences in evaluations across models? e.g. "0-shot CoT" vs "4-shot" on MATH in the table? Does Grok 2 have to report the same evaluation type as Claude 3 for each benchmark?

bought Ṁ100 NO

@ZviMowshowitz Will you resolve this question NO, if X-Twitter fails to release Grok 2 by some deadline?

@HankyUSA Explicitly yes EOY 2025.

bought Ṁ10 YES

I think there's a pretty good shot grok 2 gets released in a year and half behind whatever openAI and Anthropic are up to at that point, but exceeds the current standard