For the first HLE score for grok 4 reasoning (or if multiple are released at once, the highest), unless it is a month+ after grok 4 release, will it show up on https://scale.com/leaderboard/humanitys_last_exam As a score of 45% or more, after rounding, for the reasoning version of grok 4? Update 2025-07-06 (PST) (AI summary of creator comment): This market is about the model currently expected to be called grok 4, not strictly any model with that specific name. Update 2025-07-11 (PST) (AI summary of creator comment): If the reasoning score reported on the linked website includes tool use, it will count for this market's resolution. Update 2025-07-17 (PST) (AI summary of creator comment): The creator is resolving the market to NO. See the linked comment for their reasoning.

No — resolved on Jul 17, 2025 by Manifold Markets prediction market.

Humanity’s Last Exam lists grok 4 at 45%+?

160

Ṁ1kṀ130k

resolved Jul 17

Resolved

ALL

For the first HLE score for grok 4 reasoning (or if multiple are released at once, the highest), unless it is a month+ after grok 4 release, will it show up on

https://scale.com/leaderboard/humanitys_last_exam

As a score of 45% or more, after rounding, for the reasoning version of grok 4?

Update 2025-07-06 (PST) (AI summary of creator comment): This market is about the model currently expected to be called grok 4, not strictly any model with that specific name.

Update 2025-07-11 (PST) (AI summary of creator comment): If the reasoning score reported on the linked website includes tool use, it will count for this market's resolution.

Update 2025-07-17 (PST) (AI summary of creator comment): The creator is resolving the market to NO. See the linked comment for their reasoning.

Market context

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ13,519
2		Ṁ1,135
3		Ṁ800
4		Ṁ621
5		Ṁ503

People are also trading

When will humanity's last exam be saturated? (>80%)

Top score on Humanity's Last Exam > 80% by what year?

Top score on Humanity's Last Exam > 70% by what year?

Top score on Humanity's Last Exam > 50% by 2029?

99% chance

Top score on Humanity's Last Exam > 60% by what year?

Top score on Humanity's Last Exam > 90% by what year?

What is Grok 4 Heavy's performance on METR's task length evaluation?

Will OpenAI's o4 get above 50% on humanity's last exam?

16% chance

Will Al achieve 85% or higher on the Humanity's Last Exam benchmark before 2027?

63% chance

Will Al achieve 95% or higher on the Humanity's Last Exam benchmark before 2030?

33% chance

Sort by:

Grok 4's score is now up at 25.4%, but I'd suggest waiting to see if they release Grok 4-heavy or Grok 4 (heavy or not) with reasoning capabilities before resolving. They released Grok 4, Grok 4-Heavy (and reasoning capabilities for both) at once, so from the criteria, I assume that if it takes HLE a couple weeks to update with Grok 4 Heavy's score, that's okay? It says "the first HLE score for Grok 4 reasoning", but I'd assume that means the highest released version of Grok 4, which would be Heavy? It's also unclear if the current one on the leaderboard is reasoning or not.

https://agi.safe.ai/

@bens I’ll resolve NO now, as this is the first score for a grok 4 model, and it isn’t a month+ after release. they’ve indicated it’s a reasoning model to a sufficient extent that I count it as one

@Bayesian hmm, I mean, when this market was made, it wasn’t clear that Grok 4 would be split into Grok 4-regular and Grok 4-heavy… if they’d been instead called Grok 4-mini and Grok 4, would you have waited for the release of the full one’s scores?

@bens That would depend on whether, in that hypothetical, it would be most reasonable to consider grok 4 mini the model previously expected to be called grok 4, or not. If yes, then it would be the same as in this case. If not then we would wait til a model thwt would hsve been expected to be cslled grok 4 would appear on the lb. If grok 4 regular (as named in that hypothetical) took over a month to show up on the lb, the market would resolve NA

To rephrase, the intent of the wording was to avoid having to wait for future updates to the leaderboard (or new versions of the model) when I deemed these a priori pretty unlikely, so really it was meant to resolve based on the first sufficiently grok4ish update to the lb

@bens Grok 4 Heavy is the same model as Grok 4. The non-Heavy score evaluates Grok 4 under settings that are similar to those used for other models.

And it is definitely a reasoning model; in fact it generates more tokens than most.

I think this is potentially relevant: i think it's really unlikely that grok 4 heavy shows up on the lb. any takers around 28%?

bought Ṁ3 YES

As far as I understand this market may resolve YES or NO whether HLE decides to allow tool use or not. Sounds like a coin flip, why are odds so low?

it is known that 2.5 pro does 25%+ with tool use from the grok 4 livestream graph. It is not presently on the lb. Simple inference

it might end up wrong but coinflip seems more wrong

@SimoneRomeo even with coin flip, the number given was 44.4% on their slides. The 50.1% is a little misleading, and not actually the final value, I think. I guess there's like a 20% chance that the tool use version somehow qualifies, within that a 50% chance that the heavy version actually gets rated this month, and within that a 20% chance that by some luck, HLE calculates the final value as 45% rather than 44%. So... about 2% all together, maybe.

Does this include Tool use or no?

@KJW_01294 it's whatever they decide to put on their website at the link provided

I suspect the website doesn't allow tools but if it does, it counts!

bought Ṁ50 YES

@Trazyn grok 4 reasoning (or if multiple are released at once, the highest),

https://x.com/ArtificialAnlys/status/1943166841150644622 24%

@SCS this one is base, not Heavy, and probably lines up with Grok's estimate of 25% with no tools. Ultimately this makes me slightly optimistic because it suggests there's some variability and perhaps when HLE tests it independently, they might get 45% instead of 44% for Heavy with tools