Humanity’s Last Exam lists grok 4 at 45%+?
160
1kṀ130k
resolved Jul 17
Resolved
NO

For the first HLE score for grok 4 reasoning (or if multiple are released at once, the highest), unless it is a month+ after grok 4 release, will it show up on

https://scale.com/leaderboard/humanitys_last_exam

As a score of 45% or more, after rounding, for the reasoning version of grok 4?

  • Update 2025-07-06 (PST) (AI summary of creator comment): This market is about the model currently expected to be called grok 4, not strictly any model with that specific name.

  • Update 2025-07-11 (PST) (AI summary of creator comment): If the reasoning score reported on the linked website includes tool use, it will count for this market's resolution.

  • Update 2025-07-17 (PST) (AI summary of creator comment): The creator is resolving the market to NO. See the linked comment for their reasoning.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ13,519
2Ṁ1,135
3Ṁ800
4Ṁ621
5Ṁ503
Sort by:

Grok 4's score is now up at 25.4%, but I'd suggest waiting to see if they release Grok 4-heavy or Grok 4 (heavy or not) with reasoning capabilities before resolving. They released Grok 4, Grok 4-Heavy (and reasoning capabilities for both) at once, so from the criteria, I assume that if it takes HLE a couple weeks to update with Grok 4 Heavy's score, that's okay? It says "the first HLE score for Grok 4 reasoning", but I'd assume that means the highest released version of Grok 4, which would be Heavy? It's also unclear if the current one on the leaderboard is reasoning or not.

https://agi.safe.ai/

@bens I’ll resolve NO now, as this is the first score for a grok 4 model, and it isn’t a month+ after release. they’ve indicated it’s a reasoning model to a sufficient extent that I count it as one

@Bayesian hmm, I mean, when this market was made, it wasn’t clear that Grok 4 would be split into Grok 4-regular and Grok 4-heavy… if they’d been instead called Grok 4-mini and Grok 4, would you have waited for the release of the full one’s scores?

@bens That would depend on whether, in that hypothetical, it would be most reasonable to consider grok 4 mini the model previously expected to be called grok 4, or not. If yes, then it would be the same as in this case. If not then we would wait til a model thwt would hsve been expected to be cslled grok 4 would appear on the lb. If grok 4 regular (as named in that hypothetical) took over a month to show up on the lb, the market would resolve NA

To rephrase, the intent of the wording was to avoid having to wait for future updates to the leaderboard (or new versions of the model) when I deemed these a priori pretty unlikely, so really it was meant to resolve based on the first sufficiently grok4ish update to the lb

@bens Grok 4 Heavy is the same model as Grok 4. The non-Heavy score evaluates Grok 4 under settings that are similar to those used for other models.

And it is definitely a reasoning model; in fact it generates more tokens than most.

I think this is potentially relevant: i think it's really unlikely that grok 4 heavy shows up on the lb. any takers around 28%?

bought Ṁ3 YES

As far as I understand this market may resolve YES or NO whether HLE decides to allow tool use or not. Sounds like a coin flip, why are odds so low?

it is known that 2.5 pro does 25%+ with tool use from the grok 4 livestream graph. It is not presently on the lb. Simple inference

it might end up wrong but coinflip seems more wrong

@SimoneRomeo even with coin flip, the number given was 44.4% on their slides. The 50.1% is a little misleading, and not actually the final value, I think. I guess there's like a 20% chance that the tool use version somehow qualifies, within that a 50% chance that the heavy version actually gets rated this month, and within that a 20% chance that by some luck, HLE calculates the final value as 45% rather than 44%. So... about 2% all together, maybe.

Does this include Tool use or no?

@KJW_01294 it's whatever they decide to put on their website at the link provided

I suspect the website doesn't allow tools but if it does, it counts!

bought Ṁ50 YES

@Trazyn grok 4 reasoning (or if multiple are released at once, the highest),

@SCS this one is base, not Heavy, and probably lines up with Grok's estimate of 25% with no tools. Ultimately this makes me slightly optimistic because it suggests there's some variability and perhaps when HLE tests it independently, they might get 45% instead of 44% for Heavy with tools

lmao this is cursed

@bens @jim congrats hahahahaah

@bens wow

@jim I feel like my live trading was rational, tbh, but kudos

@bens I'm mad

@bens feckin brutal!!!

Comment hidden
© Manifold Markets, Inc.TermsPrivacy