Will a LLM beat human experts on GPQA by Jan 1, 2025?
Basic
46
16k
Jan 2
29%
chance

GQPA dataset here: https://arxiv.org/abs/2311.12022

"Human expert" means 74%.

Currently, GPT-4 gets 39%.

The LLM is allowed to use external tools (e.g. Google, Wolfram Alpha).

Get Ṁ1,000 play money
Sort by:
Usaar33boughtṀ100NO

Which set? Main? Diamond?

@Uaaar33 Extended. (That’s where the 74% number comes from)

sold Ṁ46 NO

AFAICT Anthropic models report only diamond. So will this rely on 3rd party evals?

If there are no Extended evals in the official model report, then yes I'll rely on 3rd party evals.

Is this using unlimited scaffolding/maj vote/etc. or does this have to be zero-shot CoT?

Good question. Given that I hadn't specified at the beginning, I'll allow any sort of scaffolding or ensembling.