For the first HLE score for grok 4 reasoning (or if multiple are released at once, the highest), unless it is a month+ after grok 4 release, will it show up on
https://scale.com/leaderboard/humanitys_last_exam
As a score of 45% or more, after rounding, for the reasoning version of grok 4?
Update 2025-07-06 (PST) (AI summary of creator comment): This market is about the model currently expected to be called grok 4, not strictly any model with that specific name.
Update 2025-07-11 (PST) (AI summary of creator comment): If the reasoning score reported on the linked website includes tool use, it will count for this market's resolution.
Update 2025-07-17 (PST) (AI summary of creator comment): The creator is resolving the market to NO. See the linked comment for their reasoning.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ13,519 | |
2 | Ṁ1,135 | |
3 | Ṁ800 | |
4 | Ṁ621 | |
5 | Ṁ503 |
People are also trading
Grok 4's score is now up at 25.4%, but I'd suggest waiting to see if they release Grok 4-heavy or Grok 4 (heavy or not) with reasoning capabilities before resolving. They released Grok 4, Grok 4-Heavy (and reasoning capabilities for both) at once, so from the criteria, I assume that if it takes HLE a couple weeks to update with Grok 4 Heavy's score, that's okay? It says "the first HLE score for Grok 4 reasoning", but I'd assume that means the highest released version of Grok 4, which would be Heavy? It's also unclear if the current one on the leaderboard is reasoning or not.
@bens I’ll resolve NO now, as this is the first score for a grok 4 model, and it isn’t a month+ after release. they’ve indicated it’s a reasoning model to a sufficient extent that I count it as one
@Bayesian hmm, I mean, when this market was made, it wasn’t clear that Grok 4 would be split into Grok 4-regular and Grok 4-heavy… if they’d been instead called Grok 4-mini and Grok 4, would you have waited for the release of the full one’s scores?
@bens That would depend on whether, in that hypothetical, it would be most reasonable to consider grok 4 mini the model previously expected to be called grok 4, or not. If yes, then it would be the same as in this case. If not then we would wait til a model thwt would hsve been expected to be cslled grok 4 would appear on the lb. If grok 4 regular (as named in that hypothetical) took over a month to show up on the lb, the market would resolve NA
@SimoneRomeo even with coin flip, the number given was 44.4% on their slides. The 50.1% is a little misleading, and not actually the final value, I think. I guess there's like a 20% chance that the tool use version somehow qualifies, within that a 50% chance that the heavy version actually gets rated this month, and within that a 20% chance that by some luck, HLE calculates the final value as 45% rather than 44%. So... about 2% all together, maybe.
@SCS this one is base, not Heavy, and probably lines up with Grok's estimate of 25% with no tools. Ultimately this makes me slightly optimistic because it suggests there's some variability and perhaps when HLE tests it independently, they might get 45% instead of 44% for Heavy with tools